## 这个作业是用R语言完成模型选择和偏差方差折衷有关的统计

R Midterm 2019

Name:

Please complete this take-home midterm in this R Markdown file. When finished, please rename it “RMidterm-LASTNAME-FIRSTNAME.Rmd” and submit both the R Markdown file and the HTML rendering to me via email. And don’t forget to put your name above within the document. It is due by 11:59pm on Sunday December 15, 2019. Do your own work.

Everything in this exam is to be done in the base R installation. Do not import any libraries.
A Learning Exercise

We will consider basic notions of statistical learning related to model selection and the bias-variance trade-off. Our general setting is that we observe a quantitative response Y
and p different predictors, X1,X2,…,Xp. We assume there is some true relationship between Y and X=(X1,X2,…,Xp) described by a fixed but unknown function f:

Y=f(X)+ϵ.

Here, ϵ
is a random error term that is independent of X and has mean zero. A familiar example might be linear regression, in which f(X)=a+∑pi=1bixi.

In this context, a learning algorithm is a method for finding an estimate fˆ
of f. A typical application is prediction, in which a set of inputs X is available but the output Y is not available. Because the error term has mean zero, our estimate fˆ of f can be used to predict Y with Yˆ, where

Yˆ=fˆ(X).

The error of Yˆ as a prediction for Y can be assessed using a quantity known as the mean squared error (MSE). Consider a given estimate fˆ and a fixed set of predictors X, which together yields a prediction Yˆ=fˆ(X). Then,
E(Y−Yˆ)2=E[f(X)+ϵ−fˆ(X)]2=[f(X)−fˆ(X)]2reducible+Var(ϵ)irreducible

The first term is called reducible because we can potentially improve the accuracy of fˆ
by using a more appropriate learning algorithm. After all, if f^=f, then the reducible error is zero. But even if we had the correct function f, our predictions Yˆ=f(X) would still have error because of ϵ. The error involved with ϵ is called irreducible because we cannot reduce or eliminate it no matter how well we estimate f.

The training data is the set of {(yi,xi)}
used to fit the model, i.e. come up with the estimate fˆ. The MSE on the training data can be used to fit the model (some algorithms explicitly minimize MSE) or to assess its fit. But a low MSE on the set of training data is not necessarily what we are interested in. Rather, in a prediction context, we are interested in the accuracy of the predictions that we obtain when we apply our algorithm to test data, which is data assumed to be generated the same way as the training data but that was not used to train the model. We want to choose a method that gives the lowest test MSE, as opposed to the lowest training MSE.

If no test observations are available, then one might simply select the method that minimizes the training MSE. Unfortunately, there is no guarantee that the method with the lowest training MSE will also have the lowest test MSE. Many statistical methods specifically estimate coefficients to minimize the training set MSE. For these methods, the training set MSE could be quite small (as a result of overfitting, i.e. fitting the errors too well), while the test MSE could be much bigger. This is what we will explore in this exercise.
Exercise 1

In our example the true function f

is nonlinear and is given by

f(x)=5−0.002×2+0.2x−7sin(πx/50).

The noise ϵ
is modeled as a zero-mean normal random variable with standard deviation σ.

Write an R function G that takes two arguments, an arbitrary vector x and a numeric keyword argument eps_sd that is by default set to 4, and returns a vector with elements given by f(xi)+ϵi, where xi is an element of x and ϵi is normally distributed with mean zero and standard deviation equal to eps_sd. The ϵ=(ϵi) should be independent.

Use your definition of G to graph f

for values of x between 1 and 100, inclusive. (Hint: set eps_sd=0 to remove ϵ and isolate f.) Please graph as a curve/line, not as a scatterplot. The graph you find is the true f that we will be trying to estimate with f^ using three learning methods. In the real world, of course, you don’t know f!

Exercise 2

Next, we will train three simple learning models on our data and plot their fits. First, we will fit a linear model, and then we will fit two smoothing splines with different smoothing parameters. You can read about smoothing splines here and here. Note that linear regression is a special case of a smoothing spline.

First we form the training set (ytr,xtr)

that we will use to estimate the models. For reproducibility, set the seed using set.seed(1234). Next, use the sample() function to randomly sample N=70

integers between 1 and 100 (without replacement); put this into the vector x_tr. Then set y_tr to be G applied to x_tr (keep the default eps_sd=4). Finally, create a scatter plot of x_tr and y_tr. Note that it’s not immediately obvious that the data are generated from a nonlinear model, though you of course know that it is.

Fit a linear model to the training data using lm(), storing the result in lmfit. Create a scatterplot of the training data (again), and use abline(lmfit, …) to add the estimated regression line of best fit to the plot. Make the line orange and its line width parameter equal to 2. Comment on the fit of the linear model. Are the errors approximately homoskedastic?

Next, we will estimate a smoothing spline with a high number of degrees of freedom df. The degrees of freedom df and the smoothing parameter λ

from the Wikipedia article are inversely related (the exact relationship is not important for our purposes). For example, a linear model corresponds to a smoothing spline with degrees of freedom df=2 and a smoothing parameter λ=∞

(there is an infinite penalty to any curvature, forcing a line). Here, we will compare the line from a linear regression to a smoothing spline with df=20, ten times the degrees of freedom of the line. Use smooth.spline() to fit a smoothing spline with degrees of freedom df=20, storing the result in ss20. (Recall that one can type ?smooth.spline to get the help documentation.) Create a scatterplot of the training data (again), and use lines(ss20, …) to add the estimated fitted spline to the plot. Make the fitted curve blue and its line width parameter equal to 2. Comment on the fit of the smoothing spline with df=20. Are the errors approximately homoskedastic?

Finally, we will fit a smoothing spline where the degrees of freedom (equivalently, the smoothing parameter λ) is chosen using leave-one-out cross validation. Cross validation is an essential technique in statistical learning. In R, leave-one-out cross validation can be used in a smoothing spline by setting cv=TRUE. Estimate such a smoothing spline, storing the result in sscv. Create a scatterplot of the training data (again), and use lines(sscv, …) to add the estimated fitted spline to the plot. Make the fitted curve red and its line width parameter equal to 2. Print the degrees of freedom associated with the spline by accessing one of the list elements in sscv. Comment on the fit of the smoothing spline with this lower number of degrees of freedom. Are the errors approximately homoskedastic?

On the same set of axes, plot the true f

from 1(B) (in black), the estimated regression line (in orange) from 2(B), and the two smoothing splines from (C) and (D) in blue and red, respectively. Set the line width parameter to 2 for all the lines/curves. Ensure that all lines/curves are visible by adjusting the axes limits if necessary. Based on this graph, discuss your views about which of the three learning methods we’ve tried produce the best estimate f^ of f.

Exercise 3

Next we compute the training mean squared error of our 3 models. Recall that if yi
denotes a y-value from our training set and y^i

denotes the corresponding predicted value from a learning model, then the training MSE is given by MSEtr=1N∑i=1N(yi−y^i)2.

The predicted/fitted values for the linear model lmfit can be found in the vector lmfit\$fitted.values. Compute the training MSE for the estimated linear model lmfit.

The predicted values for a smoothing spline s can be found in the vector predict(s\$fit, x_tr)\$y. Compute the training MSE for the estimated smoothing splines ss20 and sscv.

How do your MSE measures calculated in (A) and (B) compare? If you were selecting a model based solely on the training MSE, then which would you select? What is the irreducible error in this model, i.e. what is Var(ϵ)

? Is the training MSE for some models below the irreducible error? If so, what do you make of this?

Exercise 4

Now we take advantage of the fact that we know the true data-generating process, i.e. the true f
and the true Var(ϵ), to compute the true test MSE as we vary the degrees of freedom of the smoothing splines and compare it to the associated training MSE. Minimizing the test MSE over the degrees of freedom represents the optimal model within the class of smoothing splines.

Given that the smoothing splines are non-parametric in nature, we cannot derive the test MSE analytically despite knowing the data generating process. Instead, we will essentially perform a Monte Carlo analysis to achieve precise results. To do this we create a very large test sample.

Create x_te as a vector of size 10,000 of random sample of integers (with replacement) between 1 and 100. Then let y_te be G applied to x_te (keep esp_sd at its default value). The vectors x_te and y_te comprise the test sample.

Next, we will create two scatterplots next to each other. Save the default plot parameters in par() as opar and create a 1×2-grid using mfrow(). On the left panel, create a normal scatterplot of the test data with plot(x,y), and on the right panel use smoothScatter() to create a smooth scatterplot of the test data The smooth scatterplot colors areas based on the density of points, with darker colors corresponding to higher density. This helps us see that around each point we have a normal distribution with standard deviation of 2. After you create your plots, be sure to set the paramters par() back to opar.

Next we will compute the training MSE (on our training sample x_tr and y_tr) and test MSE (on the test sample x_te and y_te) for a sequence of smoothing splines with degrees of freedom df ranging from 2 to 25. We will use a for loop. First, use seq() to define a vector dof starting at 2 and ending at 25 in increments of 0.5. Instantiate mse_tr and mse_te as empty vectors to store our the training and test MSEs, respectively. Then create a for loop in which, for each degrees of freedom df iterate, you (1) estimate a smoothing spline with df degrees of freedom using the training data set, and (2) compute the training and test MSEs, storing them in mse_tr and mse_te. In the end, you will have populated the three vectors dof, mse_tr and mse_te.

On the same set of axes, plot the training and test MSEs from mse_tr and mse_te as a function of the degrees of freedom dof. Plot them as curves/lines (not scatterplots) with line width parameters equal to 2, the training MSE in grey, and test MSE in red. Use abline() to plot a dashed horizontal line corresponding to the irreducible error, the variance of ϵ. Use points() to plot the (x,y)-coordinates for the degrees of freedom and training and test MSEs corresponding to the three learning models in Exercise 2: linear regression (df=2) in orange, a smoothing spline with 20 degrees of freedom in blue, and a smoothing spline estimated using cross validation in red (the degrees of freedom are also estimated, accessible in sscv\$df; you might have to round to the nearest value in dof). Make the points filled-in circles by adjusting the parameters as appropriate. Ensure that all points and curves are visible by adjusting the axes limits if necessary.

Recall that the objective of the learning problem is to minimize the test MSE. Which method should you select based on this criterion? Which method would you select if your objective was to minimize the training MSE? Comment on the value of cross validation in this setting. 