STA302/1001 – Assignment # 2
Assignments must be submitted electronically through Crowdmark. Each student will receive a
personalized link to view the assignment (this is where you will submit your assignment when
finished). If you do not receive this email from Crowdmark, check your spam/junk folder. Instructions
for how to upload completed assignments can be found here: https://crowdmark.com/
help/completing-and-submitting-an-assignment/. Note that only PDF, PNG or JPG
file types are accepted by Crowdmark. You will need to upload certain questions into certain
places, so make sure you are submitting pages in the right place.
Students may work in groups of no more than 2 people, with only one assignment submitted to
Crowdmark per group. When you receive your personalized link to the assignment, you may then
enter your group members name. A shared submission link will be sent to both group members, so
you can both submit the assignment or edit submissions. Only one assignment should be submitted
The assignment is divided into four questions, each with subparts. Each question needs to be uploaded
under the correct section in Crowdmark, otherwise it may be overlooked when graded. One
question is a calculation-type question, one will be a theoretical/derivation/proof type question, and
two will involve using R. You should make sure to show all your work with the first two questions,
while the R questions should be presented in a report-type format (i.e. include output and graphs
with written explanations of answers in the main document, R code places in an appendix at the
end). If you are comfortable with RMarkdown, it is recommended to complete your assignment
with it. Otherwise, any word processing document will suffice for the R question. You may submit
handwritten answers for questions 1 and 2, but they must be legible and neat.
Note that there is a 20% per day late penalty on assignments. After 48 hours of being late, the
assignment will no longer be accepted. This means that you should submit your assignment no
later than Tuesday February 18 at 11:59PM to avoid receiving a grade of zero.
Question 1 (10 points) – Derivations/Proof Questions
(a) (4 points) Show that in simple linear regression, for a given significance level α, that the
squared T test statistic for testing the slope is equivalent to the F test statistic for testing
significance of the regression line.
(b) (6 points) Prove that the sum of squares decomposition holds, i.e. that SST = SSreg +RSS.
Question 2 (17 points) – Hand Calculations
We have data on both the Fat content (in grams) and the Protein content (in grams) of items
on the Burger King menu. We wish to use these 122 pairs of data in an effort to predict the Fat
content of a new menu item from the Protein content. Suppose that a simple linear regression
model was fit to these data, but we know only that the residual standard error is 10.56 and that
the sample variability in the Fat content (in grams) is 262.30. Using this information, answer the
(a) (1 points) What proportion of the variation in Fat content can be explained by Protein
(b) (6 points) Test whether the regression line significantly explains the variation in Fat content
by completing an ANOVA table. Show all your work.
(c) (4 points) Suppose now we have the following deviations from the mean of Protein content
for 5 menu items in our dataset, as well as SXX = 22011.97:
Observation ID (i) 12 17 84 92 115
(xi − x¯) 52.98 48.98 1.98 -13.02 -12.02
Determine whether each of these observations is a leverage point.
(d) (6 points) Now suppose we also have the estimated residuals for the same 5 observations. Using
these and the information in part (c), determine which of these observations are influential.
Observation ID (i) 12 17 84 92 115
Residual (ˆei) 8.78 5.43 10.36 31.06 -5.85
Question 3 (10 points) – R simulation question
Consider a simple linear regression model y = 50 + 10x + e, where e ∼ LogNormal(0, 2
that 20 pairs of observations are used to fit this model. Generate 500 samples of 20 observations,
drawing one observation at each level of x = 0.5, 1, 1.5, . . . , 10 for each sample. (Hint: to generate
the errors according to a log-Normal distribution, use rlnorm(20, 0, 2))
(a) (4 points) For each sample, compute the least squares estimates of the slope and intercept,
and present them in a histogram. Discuss the shape of these histograms.
(b) (2 points) For each sample, compute the estimate of E(y | x = 5) and present it in a histogram.
Discuss the shape.
(c) (4 points) For each sample, find the 95% confidence interval for both the slope and intercept.
How many of these intervals contain the true value of β1 = 10 and β0 = 50 respectively? Do
the same for the 95% confidence of E(y | x = 5). How many of these contain the true value
E(y | x = 5 = 100? Explain why we see these results and what it means for the usefulness of
the regression line.
Question 4 (20 points) – R data analysis question
The dataset for this question is called “grades.csv” and can be found on Quercus under Assignment
2. It contains information on the first and second midterm test grades, as well as the overall homework
grades of 64 fictional students. We are interested in studying the relationship between grades
on the first midterm (X) and the second midterm (Y ).
(a) (2 points) Fit a simple linear model relating the effect of midterm 1 grades on midterm 2
grades and interpret the coefficients.
(b) (2 points) Run an ANOVA test to determine whether there exists a statistically significant
relationship between the two midterm grades. Test your hypothesis at α = 0.05.
(c) (2 points) Produce all the residuals plots required for checking the validity of the model
(d) (3 points) Comment on whether each assumption is satisfied.
(e) (2 points) Determine whether there are any leverage or influential observations in the data.
(f) (2 points) Now suppose we are interested in whether the midterm grades have improved from
the first test to the second test. Create a new dataset that combines both midterm grades
into a single response variable and create a predictor variable that indicates whether each
grade came from the first or second midterm (and present the code). Provide a box plot
summarizing the midterm grades for each test using the new dataset.
(g) (2 points) Using the data created in part (f), fit a linear model relating the midterm grades to
the indicator variable representing the first or second test. Do we have sufficient evidence to
conclude that the midterm grades on the second test are higher than those on the first test?
Use a formal hypothesis test.
(h) (5 points) Check the model assumptions using residual plots and comment on their validity.
Do you think it’s a good idea to merge the midterm grades into a single response variable?