## 这个Homework是使用R语言完成线程回归和矩阵设计相关的编码

STAT 3701 Homework 5

Show all work. Submit your solutions in a pdf document on Canvas. Include your R code (which must be commented and properly indented) in the pdf file. Also submit one text file with all your R code (comments and all) clearly labeled with the problem it goes with. This must be properly indented. Before every solution with random sampling use set.seed(3701).

Question 1 (10 points)

Consider the linear regression where we have two explanatory variable {age,treatmentType}, where age is numerical and treatmentType is categorical with three levels {A, B, C}. The response will be an exam score. The design matrix will be generated in the same way as we did in Section 2.2 in the notes Regression Part 2, except for that we don’t have interaction terms in this case. More specifically, we have

• n = 30 subjects.

• The first third received treatment A and, the second treatment B and the last treatment C.

• Age is integer-valued and is uniformly distributed over 18 to 35.

• the true regression coefficient β = (50, 0, 10, 0) so that age is not relevant.

• The random errors are iid N(0, 52).

And the model could be write as

Y = Xβ + ϵ,

here X be a n × 4 design matrix with first column being 1 and second column being observations of age,third column is the dummy variable for level B of treatmentType and last column is the dummy variable for level C of treatmentType.

We are interested testing whether age is correlated to the response, i.e., if we let the regression coefficient of age be β2, we want to test the hypothesis

H0 : β2 = 0

Ha : β2 ̸= 0.

We’ve talked about two ways to conduct this test: the t-test and the F-test. In this question, we are interested in comparing those two tests.

(a) (5 points) Describe how you will test the hypotheses above using t-test and F-test. You need to write down the test statistic, the distribution of the test statistic under H0 and how p-value is calculated for each test.

(b) (5 points) Now we will use simulation to compare those two tests. Set reps = 1000 and significance level α = 0.05. We will generate reps realizations of data. For each realization, we will test the above hypotheses using both t-test and F-test and record whether H0 is rejected in the two tests respectively.

In how many realizations, the two tests give different conclusion, i.e., only one of the test reject H0?

What do you conclude on the two tests?

Question 2 (15 points)

We may be interested in testing if linear combinations of the regression coefficients are equal to zero. The code currently in the notes only accounts for cases when multiple regression coefficients are equal to zero.

For example, we may be interested in the two sided hypothesis test

H0 : β2 + β3 = 0

Ha : β2 + β3 ̸= 0

In this problem you will write code to handle such a hypothesis test.

(a) (5 points) Using the formulas from section 1.2 in the notes Regression Part 2 write a function called gen.pvals.linear.combination that simulates hypothesis tests and outputs the list of observed p values. Let the errors be distributed N(0, σ2). The function should take as inputs:

• X, the design matrix

• beta, the true regression coefficients

• sigma, the true standard deviation

• C, the matrix defining the linear combinations

• reps, the number of independent replications

The function should output pval.list a list of realizations of p-values.

(b) (5 points) Generate a design matrix using the generate.X function defined on page 10 of the notes of Regression Part 2. Use n = 20, mu = 10, σX = 1 and ρ = 0.8. Use your function from part (a) to simulate p-values for the hypothesis test

H0 : β2 + β3 = 0

Ha : β2 + β3 ̸= 0

Use β = (10, 1, −1, 0), σ = 0.5 and reps = 5000. You C matrix should have one row and four columns. Use these realizations of p-values to give a 95% score CI for the Type I error probability of the test when α = 0.05.

(c) (5 points) Using the same design matrix from part b use your function to simulate p-values for the hypothesis test

H0 : β2 = β3 = β4

Ha : β2, β3, β4 are not all equal

Use β = (10, 1, 1, 0), σ = 0.5 and reps = 5000. You C matrix should have two rows and four columns. Use these realizations of p-values to give a 95% score CI for the power of the test when α = 0.05.

Question 3 (25 points)

In this question, we will compare AIC and BIC under multicollinearity, changing standard deviation of random errors and changing sample size.

We will use the generate.X function defined on page 10 of the notes of Regression Part 2 to generate the design matrix X.

(a) (5 points) We know generate.X will return a matrix with first column standing for intercept, second for X1, third for X2 and last for X3. List out all the eligible subset model.

(b) (6 points) Now let ρ ∈ {0.25, 0.5, 0.98}. Let σX = 1, n = 50, µ = 10. For each ρ, generate the design matrix with generate.X. Then use reps=1000 realizations of data to estimate (1) the probability that AIC choose the true model and (2) the probability that BIC choose the true model. In each realization, use β = (1, 1, 0, 1)′ and use N(0, 4) to generate the random errors. Create a 95% score CI for those two probabilities.

(c) (7 points) Now let n ∈ {10, 20, . . . , 100}. Let σX = 1, ρ = 0.5, µ = 10. For each n, generate the design matrix with generate.X. Then use reps=1000 realizations of data to estimate (1) the probability that AIC choose the true model and (2) the probability that BIC choose the true model.

In each realization, use β = (1, 1, 0, 1)′ and use N(0, 4) to generate the random errors. Create a 95% score CI for those two probabilities. Create a plot of n against the estimated probability for each information criterion and include the CI in the plot.

(d) (7 points) Now let σ ∈ {1, 1.2, 1.4, . . . 4}. Let σX = 1, ρ = 0.5, n = 50, µ = 10. Generate the design matrix with generate.X. For each σ, use reps=1000 realizations of data to estimate (1) the probability that AIC choose the true model and (2) the probability that BIC choose the true model.

In each realization, use β = (1, 1, 0, 1)′ and use N(0, σ2) to generate the random errors. Create a 95% score CI for those two probabilities. Create a plot of σ against the estimated probability for each information criterion and include the CI in the plot.