Assignment 2
Contents
Assignment 2#
Written by Eric Johnson and Madhav Mani
Most recently compiled on March 14, 2023
Assignment Due Dates (Fall 2020):
The Assignment Attempt is due on October 16, 2020 at 11:59PM CST
The Complete Assignment is due on October 23, 2020 at 11:59PM CST
Please follow the guidelines for assignment attempts and completion. In particular, the attempt must be a Jupyter notebook or PDF that clearly enumerates the problem attempts, and the completed assignment must be code and a PDF containing completed solutions. Please use complete sentences in your solutions. All figures should have labeled axes, legends, and appropriate annotations.
Please read the entire assignment before getting too deep into any one problem so that you can properly allocate your time and questions.
Learning Objectives:#
The module learning objectives assessed in this assignment are that a student will be able to:
PCS-1: Calculate estimates and intervals (E&I) using theoretical formulas.
PCS-2: Calculate E&I using bootstrapping.
PCS-3: Perform OLS linear regression.
PCS-4: Perform cross-validation to estimate model error.
TS-1: Understand and discuss methods for making parameter E&I.
TS-2: Understand and discuss OLS assumptions in model fitting.
MVD: Illustrate uncertainty in E&I from data.
NQP-1: Compare and critique different methods for calculating E&I and determine optimal methods for your data.
NQP-2: Assess whether a given model is under- or over-fit and describe how to improve the model or fitting method.
You can learn more about where to study and practice these skills in the curriculum alignment table.
Problem 1: Poisson Rate Estimation#
Consider the bacterial chemotaxis data from the first assignment here. Generate lists of the time intervals,
In Assignment 1, you examined the distributions of
Specifically, in this problem you will examine three different methods for estimating
1.1: Maximum Likelihood Estimation and Confidence Intervals#
As pointed out in the notes and worksheets, MLEs and Confidence Intervals involve analytically manipulating theoretical models. In this problem, we will examine three different sets of assumptions, each a refinement on the last, to underscore what this theoretical process actually looks like in practice.
In terms of the actual maximum likelihood point estimator, it can be shown that using
maximizes the likelihood function for exponentially distributed data,
Calculate
To complete the estimation, we need to also give a confidence interval for
PCS-1, TS-1, NQP-1 If you didn’t know anything about exponential distributions or confidence intervals, you may have seen that sometimes people give intervals as
mean
stddev
. Will this work for our estimate of the rate parameter, ? Try it out: calculatefor both sets of intervals
and . Describe these intervals. Are they reasonable? Are the respective MLEs, and in the intervals?PCS-1, TS-1, NQP-1 If you do some more reading, you might find that the formula for a confidence interval is often given as
(32)#where the standard error is given as
stderror
stddev
, and is the percentile of the standard normal distribution. You can calculate percentiles of a normal distribution usingscipy.stats.norm.ppf(p)
, we havemean
, so we just needstddev
to complete the formula. You may be tempted to use the standard deviation from the data as in the previous problem, but a quick Wikipedia search shows that the variance of exponentially distributed variables is , so that maybe we can usestddev
. Thus the interval now becomes(33)#Use this formula to calculate confidence intervals for
for both sets of intervals, and . Discuss these intervals. Are they reasonable? Is in the respective interval? Importantly, does it make sense for the confidence interval to be symmetric?PCS-1, TS-1, NQP-1 It turns out that the confidence interval formula (Equation (32)) you just implemented is actually only the formula for confidence intervals of normally-distributed random variables! However, you can do a good amount of manipulation, as shown here, here, and of course, here, that
is actually distributed according to a distribution (pronounced “chi-squared”, where “chi” rhymes with “sky”). As a result, we can write the confidence interval exactly as(34)#where
is the percentile of the distribution with degrees of freedom. In Python, you can calculate this usingscipy.stats.chi2.ppf(p, df=k)
.Use this formula to calculate confidence intervals for
for both sets of intervals, and . Discuss these intervals. Are they reasonable? Is in the respective interval? Importantly, how do they compare to the approximate intervals from the previous problem? Does it seem that a normal approximation is “good enough”? Are these intervals symmetric?PCS-1, MVD, NQP-1 Rather than leaving the last question vague and qualitative, let’s be precise and explore how good of an approximation Equation (33) provides compared to the exact answer in Equation (34). To do this, randomly select
intervals from either list, or , calculate and the exact and approximate confidence intervals. Using an appropriate set of values for , generate figures that show both how the size of the confidence interval depends on and how the approximate and exact intervals differ from each other. Can you determine how many samples are needed before the intervals are indistinguishable?
1.2: Visualizing Confidence Intervals#
To further unpack the differences in the confidence interval formulas, let’s try and visualize the problem more directly.
PCS-1 Randomly draw
intervals from one of the lists ( or ) and estimate the rate parameter (calculate a point estimate and and interval!). Save this estimate aslHat0
andlConfInt0
. Use the exact confidence interval formula in Equation (34).PCS-1,2, TS-1 Repeat the previous problem
times, where is a large number of your choosing. Show the distribution of estimates, and describe its shape. Where is it centered? On\lHat0
? On the value calculated in 1.a?PCS, TS-1 Equation (33) suggested that
is normally distributed. Overlay a plot of a normal distribution with mean, , and standard deviation , where these can be calculated using all the intervals. Qualitatively discuss how well the distribution of estimates match this curve.
4.PCS-2, TS-1 The exact confidence interval (Equation (34)) used the fact that scipy.stats.chi2.pdf(xCoords, df=2*N, scale=1/(2*N*tauBar))
to overlay a plot of this distribution. Qualitatively discuss how well the distribution of estimates match this curve.
TS-1 , MVD Indicate where
lHat0
andlConfInt0
fall on the distribution. Determine the fraction of the estimates that fall inlConfInt0
. Discuss this fraction - does it make sense to you? (If needed, run your code several times to convince yourself!)TS-1 , MVD Recreate your figure using (empirical) CDFs of the estimates and the theoretical curves. Show the
and percentiles of the estimates, the median of the estimates,lHat0
andlConfInt0
. Comment on any new insights from this figure.
1.3: Maximum A Posteriori Estimation and Credible Intervals#
We’ll now explore how you can make MAP estimates of the rate parameter,
In particular, it can be shown that if we use a gamma distribution to describe our prior expectation for
A gamma distribution has the form
where
For exponentially distributed random variables,
Putting the gamma prior and likelihood function into Bayes’ Theorem, we can rearrange and normalize to find that the posterior function for
where
So if we then choose different scipy.stats.gamma(alpha, scale=1/beta)
to manipulate the posterior in Python. You will explore this process on your own in the following sub-problems.
MVD Set
and and plot the prior distributions corresponding to these parameters. Describe these curves qualitatively.PCS-1, TS-2, MVD Using
, , and then , , use Equation (35) to plot posterior PDFs and CDFs for and (using the two lists and , respectively). Calculate and annotate the mean, median, and mode on these plots. Compare these values with the MLEs from earlier in the problem.MVD, NQP-1 The mean of a gamma distribution can be calculated as the ratio of the shape to the rate parameters (
), or in the case of our exponentially distributed data and a gamma prior, we get that the posterior mean isAs in 1.1.4, show how the mean changes as you change the number of samples. Assess whether the choice of prior impacts the mean. Annotate the MLE
. Do the MLE and posterior mean agree?MVD, NQP-1 Using
scipy.stats.gamma.ppf
, annotate the median, , and percentiles. How many data points do you need to add before the MLE is inside the credible interval specified by these percentiles?PCS-2, MVD, NQP-1 Use bootstrapping to estimate the variance of each of the sets of intervals
and . You should report a point estimate and a confidence interval. You are free to choose how many bootstrap resamplings to do, how to make the point estimate, and what confidence level you want to report, but you should clearly indicate your decisions.
Problem 2: Linear Regression 1#
In this problem, you’re going to explore how changing the “noise” in your data impacts your ability to do linear regression. In this problem, you’ll practice using linear regression routines and parameter estimation methods to make estimates of regression coefficients. You will also explore how using synthetic data allows you to compare outcomes to ground truth.
2.1: Generate Data#
MVD Generate two sets of
samples, and , where is generated from using the formula . In the first set, use (the noise is normally distributed with a *variance} of 2), and in the second set use (the noise is uniformly distributed from to ). Plot these two datasets.PCS-2, TS-2, MVD Use linear regression methods (see Worksheet 2.2) to fit the model
to both your synthetic data sets. Compare the regression coefficients and plot the fitted model over the data.
TS-2, MVD, NQP-2 Calculate the residuals of the data
for each of the models. Plot the distribution of these residuals. Are they normally distributed? Considering your plots of the raw data before you did any fitting, could you have determined “by eye” whether the data had normally distributed noise or not?
2.2: Estimating Regression Coefficients#
Select
PCS-1,3 We won’t discuss the theory here, but just as we learned that exponential rate parameters are
-distributed, it can be shown that regression coefficients are -distributed, so that the formula for their confidence interval is given bywhere
In these formulas,
is the mean of , and is the model prediction and is the \ts{th} percentile of the Student’s -distribution with degrees of freedom (scipy.stats.t.ppf(p, nu)
).Calculate the 95% confidence interval for your estimates
from both data sets using this formula.PCS-1,3 Alternately, we can pose the problem in a Bayesian way. Use the priors in the notes for
with and . Use . Then use Equation 29 from the notes andscipy.stats.norm.ppf
to calculate 95% credible intervals for from both data sets.PCS-2,3 Bootstrap your
data points from each synthetic data set many times. Use your preferred linear regression method to generate an estimate from each bootstrapped sample. Report the 95% confidence interval from the distribution of bootstrapped estimates.MVD, NQP-2 Illustrate these intervals with a relevant figure. Discuss the similarities and differences in these intervals. Is
in any of these intervals?
2.3:#
PCS, TS, MVD, NQP Choose one of the three methods for generating intervals of confidence and make a figure that demonstrates how this interval depends on the amount of data. Use either synthetic data set. Indicate approximately how many data points you need before you can be “sure” that
Problem 3: Linear Regression II (BONUS)#
(This problem is not required for completion of the assignment. Do not stop reading the assignment here! There is another problem!)
In this problem you’re going to deal with real data! You’re going to move past generating intervals of confidence and instead think about a model’s predictive ability.
3.1:#
MVD, NQP-2 BONUS - The Data: Consider the dataset here. As described here, the data consist of some measurements of abalones, which are a type of ocean mollusk. Explore the data set and describe the measurements. We’re going to use the variable rings
as our response variable; pick two of the continuous measurements (not sex
or rings
) and standardize your response variable,
3.2: Testing and Training Sets#
PCS, TS BONUS: Break your data into two sets, a training set containing 90% of the data and a test set containing the rest. Fit the model
rings
using OLS to the training data. Give estimates and confidence intervals for the parameters (therings
column is a proxy for the age of the abalone). If you’re using a specific package or formula to make these calculations, indicate it. Using 10 distinct training-test splits, compute the average prediction error of your model on the testing data.MVD, NQP-2 BONUS: Make a plot of
vs and a plot of vs . Indicate training data in one color and testing data in another using one of your training-test splits from the previous problem. Plot some best-fit lines on these figures. (Since the model has two parameters, the fitting isn’t of a line, but of a plane, so on the vs plot, you’ll have to be creative!)MVD, NQP-2 BONUS: For one of the training-testing splits, show the distribution of residuals from the training data and the distribution of residuals from the testing data. Are they similar in shape?
3.3: Where is the error?#
MVD BONUS: Make a scatter plot of
vs and overlay a grid that covers your data.PCS-3, TS-2 BONUS: Calculate the average prediction error of the fitted model on the test points that lie in each of the 100 grid cells using 10-fold cross-validation. If there are no test points in the cell for a particular training-test split, don’t incorporate that split into the average prediction error.
MVD, NQP-2 BONUS: Display these prediction errors with an appropriate figure. Describe where in
- space the model seems to perform better or worse.
3.4 Bootstrapping#
PCS-2 BONUS: Generate many bootstrapped estimates of the regression coefficients,
and . Only use the training data!PCS-2, MVD BONUS: Plot the distributions of
and on separate plots. Show your confidence/credible interval from earlier in the problem as well as the and percentiles of the bootstrapped distributions.TS, MVD, NQP-2 BONUS: Make a scatter plot of the bootstrapped
vs . Overlay the confidence/credible intervals and the and percentiles as vertical and horizontal lines. Based on this figure, does it seem that and can be fit independently? How would this figure look if you could?
Problem 4: Fitting a Thermodynamic Model:#
As described in the course notes, in this 2011 paper by Garcia and Phillips, the authors developed a model based on the principles of thermodynamics in order to measure the number of LacI repressors present in different strains of E. coli, as well as the binding energy of those repressors. In Figure S8 of their paper, they show how given estimates for the number of repressors,
to generate an estimate for the binding energy of those repressors,
You may or may not be surprised to learn that a common experimental technique for measuring cellular activity is to take pictures of cells at specific wavelengths of light. This measures activity because the cells have been modified so that specific proteins fluoresce at specific wavelengths. What this looks like then is the images in this folder. In each of these images, a specific strain of bacteria has been imaged at a couple wavelengths (a yellow one and a red one), and the idea is that the amount of light (number of photons) measured at those wavelengths directly corresponds to how much of a desired protein (LacI) exists in the cells or their environment.
4.1:#
MVD The immediate trouble then is that bacteria don’t sort themselves nicely into grids or patterns, so we have to either manually indicate where bacteria are in the images or help the computer use the images to detect the bacteria. This might not seem worth it for this data set, but modern experiments regularly have hundreds or thousands of images, so automating this process is necessary! (It’s also more consistent!) To help you see what this process looks like, we have prepared a Python module for processing these specific images.
Download the module into the same location that you put the data (and that you are running your notebook). Use the assistance notebook here to learn how to extract the YFP values from the images. Describe the workflow of going from images to fold-change in YFP fluorescence in that notebook in your own words. Use an appropriate figure to show the distribution of YFP fluorescence in each of the 7 strains. Predict which non-control strain (not Auto
or Delta
) contains the most repressors and which contains the least.
4.2.#
PCS-3, TS, MVD, NQP-2 Using the direct measurements of the repressor copy number,
Describe and interpret this result. Does the curve increase or decrease? Does it have an interesting shape? Why might this be consistent or inconsistent with what you know about the underlying biology? Finally, comment on why OLS might not provide the best estimates for the “regression coefficient,”
4.3: (BONUS)#
PCS-3, TS, MVD, NQP-2 BONUS: You may have noted that the model in Equation (36) (Equation 5 in the paper) is non-linear! However, we can sometimes apply transformations or approximations to make things look linear. In particular, it is often useful to fit the log-transformed model, as the logarithm “smooths” out large changes in our responses.
Fit the log-transformed model using OLS or Bayesian Regression and compare to the earlier results. Plot both results on a figure with log-scaled axes.
4.4: (BONUS)#
TS-2 NQP-2 BONUS: In this problem, it might not seem that bootstrapping is a useful tool, since we only have 5 data points. Thinking especially about how we generated