这个Assignment是用R语言完成统计相关的小练习

STA302/1001 – Assignment # 2

Instructions:

Assignments must be submitted electronically through Crowdmark. Each student will receive a

personalized link to view the assignment (this is where you will submit your assignment when

finished). If you do not receive this email from Crowdmark, check your spam/junk folder. Instructions

for how to upload completed assignments can be found here: https://crowdmark.com/

help/completing-and-submitting-an-assignment/. Note that only PDF, PNG or JPG

file types are accepted by Crowdmark. You will need to upload certain questions into certain

places, so make sure you are submitting pages in the right place.

Students may work in groups of no more than 2 people, with only one assignment submitted to

Crowdmark per group. When you receive your personalized link to the assignment, you may then

enter your group members name. A shared submission link will be sent to both group members, so

you can both submit the assignment or edit submissions. Only one assignment should be submitted

per group.

The assignment is divided into four questions, each with subparts. Each question needs to be uploaded

under the correct section in Crowdmark, otherwise it may be overlooked when graded. One

question is a calculation-type question, one will be a theoretical/derivation/proof type question, and

two will involve using R. You should make sure to show all your work with the first two questions,

while the R questions should be presented in a report-type format (i.e. include output and graphs

with written explanations of answers in the main document, R code places in an appendix at the

end). If you are comfortable with RMarkdown, it is recommended to complete your assignment

with it. Otherwise, any word processing document will suffice for the R question. You may submit

handwritten answers for questions 1 and 2, but they must be legible and neat.

Note that there is a 20% per day late penalty on assignments. After 48 hours of being late, the

assignment will no longer be accepted. This means that you should submit your assignment no

later than Tuesday February 18 at 11:59PM to avoid receiving a grade of zero.

1

Question 1 (10 points) – Derivations/Proof Questions

(a) (4 points) Show that in simple linear regression, for a given significance level α, that the

squared T test statistic for testing the slope is equivalent to the F test statistic for testing

significance of the regression line.

(b) (6 points) Prove that the sum of squares decomposition holds, i.e. that SST = SSreg +RSS.

2

Question 2 (17 points) – Hand Calculations

We have data on both the Fat content (in grams) and the Protein content (in grams) of items

on the Burger King menu. We wish to use these 122 pairs of data in an effort to predict the Fat

content of a new menu item from the Protein content. Suppose that a simple linear regression

model was fit to these data, but we know only that the residual standard error is 10.56 and that

the sample variability in the Fat content (in grams) is 262.30. Using this information, answer the

following:

(a) (1 points) What proportion of the variation in Fat content can be explained by Protein

content?

(b) (6 points) Test whether the regression line significantly explains the variation in Fat content

by completing an ANOVA table. Show all your work.

(c) (4 points) Suppose now we have the following deviations from the mean of Protein content

for 5 menu items in our dataset, as well as SXX = 22011.97:

Observation ID (i) 12 17 84 92 115

(xi − x¯) 52.98 48.98 1.98 -13.02 -12.02

Determine whether each of these observations is a leverage point.

(d) (6 points) Now suppose we also have the estimated residuals for the same 5 observations. Using

these and the information in part (c), determine which of these observations are influential.

Observation ID (i) 12 17 84 92 115

Residual (ˆei) 8.78 5.43 10.36 31.06 -5.85

3

Question 3 (10 points) – R simulation question

Consider a simple linear regression model y = 50 + 10x + e, where e ∼ LogNormal(0, 2

2

). Suppose

that 20 pairs of observations are used to fit this model. Generate 500 samples of 20 observations,

drawing one observation at each level of x = 0.5, 1, 1.5, . . . , 10 for each sample. (Hint: to generate

the errors according to a log-Normal distribution, use rlnorm(20, 0, 2))

(a) (4 points) For each sample, compute the least squares estimates of the slope and intercept,

and present them in a histogram. Discuss the shape of these histograms.

(b) (2 points) For each sample, compute the estimate of E(y | x = 5) and present it in a histogram.

Discuss the shape.

(c) (4 points) For each sample, find the 95% confidence interval for both the slope and intercept.

How many of these intervals contain the true value of β1 = 10 and β0 = 50 respectively? Do

the same for the 95% confidence of E(y | x = 5). How many of these contain the true value

E(y | x = 5 = 100? Explain why we see these results and what it means for the usefulness of

the regression line.

4

Question 4 (20 points) – R data analysis question

The dataset for this question is called “grades.csv” and can be found on Quercus under Assignment

2. It contains information on the first and second midterm test grades, as well as the overall homework

grades of 64 fictional students. We are interested in studying the relationship between grades

on the first midterm (X) and the second midterm (Y ).

(a) (2 points) Fit a simple linear model relating the effect of midterm 1 grades on midterm 2

grades and interpret the coefficients.

(b) (2 points) Run an ANOVA test to determine whether there exists a statistically significant

relationship between the two midterm grades. Test your hypothesis at α = 0.05.

(c) (2 points) Produce all the residuals plots required for checking the validity of the model

assumptions.

(d) (3 points) Comment on whether each assumption is satisfied.

(e) (2 points) Determine whether there are any leverage or influential observations in the data.

(f) (2 points) Now suppose we are interested in whether the midterm grades have improved from

the first test to the second test. Create a new dataset that combines both midterm grades

into a single response variable and create a predictor variable that indicates whether each

grade came from the first or second midterm (and present the code). Provide a box plot

summarizing the midterm grades for each test using the new dataset.

(g) (2 points) Using the data created in part (f), fit a linear model relating the midterm grades to

the indicator variable representing the first or second test. Do we have sufficient evidence to

conclude that the midterm grades on the second test are higher than those on the first test?

Use a formal hypothesis test.

(h) (5 points) Check the model assumptions using residual plots and comment on their validity.

Do you think it’s a good idea to merge the midterm grades into a single response variable?

Explain.

5