STA302/1001 – Assignment # 2
Assignments must be submitted electronically through Crowdmark. Each student will receive a personalized link to view the assignment (this is where you will submit your assignment when finished). If you do not receive this email from Crowdmark, check your spam/junk folder. Instructions for how to upload completed assignments can be found here: https://crowdmark.com/help/completing-and-submitting-an-assignment/. Note that only PDF, PNG or JPG file types are accepted by Crowdmark. You will need to upload certain questions into certain places, so make sure you are submitting pages in the right place.
Students may work in groups of no more than 2 people, with only one assignment submitted to Crowdmark per group. When you receive your personalized link to the assignment, you may then enter your group members name. A shared submission link will be sent to both group members, so you can both submit the assignment or edit submissions. Only one assignment should be submitted per group.
The assignment is divided into four questions, each with subparts. Each question needs to be uploaded under the correct section in Crowdmark, otherwise it may be overlooked when graded. One question is a calculation-type question, one will be a theoretical/derivation/proof type question, and two will involve using R. You should make sure to show all your work with the first two questions,while the R questions should be presented in a report-type format (i.e. include output and graphs with written explanations of answers in the main document, R code places in an appendix at the end). If you are comfortable with RMarkdown, it is recommended to complete your assignment with it. Otherwise, any word processing document will suffice for the R question. You may submit handwritten answers for questions 1 and 2, but they must be legible and neat.
Note that there is a 20% per day late penalty on assignments. After 48 hours of being late, the assignment will no longer be accepted. This means that you should submit your assignment no later than Tuesday February 18 at 11:59PM to avoid receiving a grade of zero.
Question 1 (10 points) – Derivations/Proof Questions
(a) (4 points) Show that in simple linear regression, for a given significance level α, that the squared T test statistic for testing the slope is equivalent to the F test statistic for testing significance of the regression line.
(b) (6 points) Prove that the sum of squares decomposition holds, i.e. that SST = SSreg +RSS.
Question 2 (17 points) – Hand Calculations
We have data on both the Fat content (in grams) and the Protein content (in grams) of items on the Burger King menu. We wish to use these 122 pairs of data in an effort to predict the Fat content of a new menu item from the Protein content. Suppose that a simple linear regression model was fit to these data, but we know only that the residual standard error is 10.56 and that the sample variability in the Fat content (in grams) is 262.30. Using this information, answer the following:
(a) (1 points) What proportion of the variation in Fat content can be explained by Protein content?
(b) (6 points) Test whether the regression line significantly explains the variation in Fat content by completing an ANOVA table. Show all your work.
(c) (4 points) Suppose now we have the following deviations from the mean of Protein content for 5 menu items in our dataset, as well as SXX = 22011.97:
Observation ID (i) 12 17 84 92 115
(xi − x¯) 52.98 48.98 1.98 -13.02 -12.02
Determine whether each of these observations is a leverage point.
(d) (6 points) Now suppose we also have the estimated residuals for the same 5 observations. Using these and the information in part (c), determine which of these observations are influential.
Observation ID (i) 12 17 84 92 115
Residual (ˆei) 8.78 5.43 10.36 31.06 -5.85
Question 3 (10 points) – R simulation question
Consider a simple linear regression model y = 50 + 10x + e, where e ∼ LogNormal(0, 22). Suppose that 20 pairs of observations are used to fit this model. Generate 500 samples of 20 observations,drawing one observation at each level of x = 0.5, 1, 1.5, . . . , 10 for each sample. (Hint: to generate the errors according to a log-Normal distribution, use rlnorm(20, 0, 2))
(a) (4 points) For each sample, compute the least squares estimates of the slope and intercept,and present them in a histogram. Discuss the shape of these histograms.
(b) (2 points) For each sample, compute the estimate of E(y | x = 5) and present it in a histogram.
Discuss the shape.
(c) (4 points) For each sample, find the 95% confidence interval for both the slope and intercept.
How many of these intervals contain the true value of β1 = 10 and β0 = 50 respectively? Do the same for the 95% confidence of E(y | x = 5). How many of these contain the true value E(y | x = 5 = 100? Explain why we see these results and what it means for the usefulness of the regression line.
Question 4 (20 points) – R data analysis question
The dataset for this question is called “grades.csv” and can be found on Quercus under Assignment 2. It contains information on the first and second midterm test grades, as well as the overall homework grades of 64 fictional students. We are interested in studying the relationship between grades on the first midterm (X) and the second midterm (Y ).
(a) (2 points) Fit a simple linear model relating the effect of midterm 1 grades on midterm 2 grades and interpret the coefficients.
(b) (2 points) Run an ANOVA test to determine whether there exists a statistically significant relationship between the two midterm grades. Test your hypothesis at α = 0.05.
(c) (2 points) Produce all the residuals plots required for checking the validity of the model assumptions.
(d) (3 points) Comment on whether each assumption is satisfied.
(e) (2 points) Determine whether there are any leverage or influential observations in the data.
(f) (2 points) Now suppose we are interested in whether the midterm grades have improved from the first test to the second test. Create a new dataset that combines both midterm grades into a single response variable and create a predictor variable that indicates whether each grade came from the first or second midterm (and present the code). Provide a box plot summarizing the midterm grades for each test using the new dataset.
(g) (2 points) Using the data created in part (f), fit a linear model relating the midterm grades to the indicator variable representing the first or second test. Do we have sufficient evidence to conclude that the midterm grades on the second test are higher than those on the first test?
Use a formal hypothesis test.
(h) (5 points) Check the model assumptions using residual plots and comment on their validity.Do you think it’s a good idea to merge the midterm grades into a single response variable?Explain.
以上就是关于STA302/1001 – Assignment # 2统计作业的全部内容了，有需要相关作业代写，辅导服务的可以添加我们客服微信进行相关咨询。
EasyDue™ 支持PayPal, AliPay, WechatPay, Taobao等各种付款方式!
E-mail: firstname.lastname@example.org 微信:easydue