这是一篇英国的统计方法限时测试数学代写

 

SECTION A

1. Consider the linear regression model yi = xTi β + i, i = 1, . . . , n, with β ∈ Rp, andindependent errors i ∼ N(0, σ2). Denote by βˆ the least squares estimate of β. We are now given a new vector x0 ∈ R p of predictor values, and we are interested in the uncertainty associated with the prediction

yˆ0 = xT0 βˆ

of the true, unknown response value y0. You can use without proof that Var(βˆ) =σ2(XTX)−1, with X denoting the design matrix.

(a) Derive an expression for the variance of prediction, that is Var(y0−yˆ0). Hence,by replacing σ2 by its unbiased estimate s2, and taking a square root of theresulting expression, give an expression for the prediction error.

(b) Write down the expression for a (1 − α) prediction interval for y0.

(c) Explain qualitatively why, in the case n −→ ∞, the prediction interval from part (b) can be approximated by

xT0 βˆ ± zα/2s (1)

where zα/2 is the Gaussian quantile with right tail probability α/2.

(d) Is the interval defined by (1) generally smaller, or generally wider, than its original counterpart from part (b)? Explain your answer carefully, and also explain your reasoning if no general statement can be made.

2. (a) Let A be a factor with a levels {1, . . . , a}. We consider a regression problem of type

E[y|A] = β0 + β1xA1 + · · · + βa−1xAa−1, (2)

using dummy coding xAj = 1{A=j}, and with the last category a serving as reference category. Let µj = E(y|A = j).

Write the expectations µj as functions of the βj. From this, derive expressions for the parameters βj, j = 0, . . . , a − 1 as functions of the expectations µj,j = 1, . . . , a. Interpret the result briefly.

(b) An experiment was conducted to assess the potency of various constituents of orchard sprays in repelling honeybees. Individual cells of dry comb were filled with a measured amount of lime sulphur emulsion in sucrose solution. Seven different concentrations of lime sulphur (labelled A,B, . . . , G) ranging from a concentration of 1/100 to 1/1,562,500 were used, in addition to a solution containing no sulphur (labelled H) [In the notation from part (a), it is clear that factor levels 1, . . . , a just correspond to concentration types A, . . . , H].

The responses for the eight different solutions were obtained by releasing 100 bees into the chamber for two hours, and then measuring the decrease in volume of the solutions in the individual cells. Eight replicates were obtained per concentration type, resulting in the following data set:

Concentration Reduction of volume
A 2 2 5 4 5 12 4 3
B 8 6 4 10 7 4 8 14
C 15 84 16 9 17 29 13 19
D 57 36 22 51 28 27 20 39
E 95 51 39 114 43 47 61 55
F 90 69 87 20 71 44 57 114
G 92 71 72 24 60 77 72 80
H 69 127 72 130 81 76 81 86

For this data set, a linear model of type (2) is fitted. Give the estimates βˆ0, βˆ1and βˆ7. Explain your working.

3. Consider a standard linear model Y = Xβ + E with p predictors and an intercept, with E[E] = 0 and Var(E) = σ 2I, where I is the n × n identity matrix. In what follows you may use the result

Cov(AY , BY ) = AVar(Y )BT

where Y is any random vector and A and B are known matrices.
(a) Given that the least squares estimator of β is βˆ = (XTX)−1XTY ,

(i) show that the vector of residuals ˆE= Y − Xβˆ and the vector of fitted values Yˆ = Xβˆ can be written as ˆE = (I − H)Y and Yˆ = HY , whereH = (hij ) is the hat matrix X(XTX)−1XT;

(ii) show that Var(ˆEi) = (1 − hii)σ2 and Cov(ˆEi, yˆi) = 0, and discuss briefly practical uses of these results.
You may assume that H2 = H = HT.
(b) Discuss briefly the terms “influential”, “potentially influential” (leverage), and “outlier” in the context of a standard linear regression model with p predictors including an intercept, distinguishing carefully between model diagnosis and robustness of conclusions. In your discussion refer to the terms in the following expression for Cook’s distance Di (for case i = 1, 2, …, n)

Di =r2i/phii/1 − hii

where hii is leverage and ri is the “internally studentised residual”.