这个作业是用R语言分析化妆品公司的销售数据和广告支出数据
Homework 6
一家国家化妆品公司的地区销售办公室的助理获得了有关广告支出和
去年在该地区的44个地区进行了销售。数据是cosmetics.csv
X1:美容院和百货商店中的销售点展示费用(X $ 1000)。
X2:本地媒体广告支出。
X3:按比例分配的国家媒体广告支出。
Y:销售额(X $ 1000)。
1.(5)测试销售与三个预测变量之间的回归关系。陈述假设,
检验统计量和自由度,p值,以语言得出的结论。
H0:所有参数均等于零
Ha:至少一个参数不等于零
F检验统计量= 38.28,p值= 7.821e-12
我们拒绝原假设。至少有一个beta不等于零。
2
2.(5)使用“通常”图(散点图,
残差图,直方图/ QQ图)。详细说明每个假设是否似乎都是
严重违反。
3
根据残差图的外观,可能存在线性关系和恒定方差。
很难判断是否存在违反正态分布的情况,因为没有明显的
QQ情节中出现异常模式。
3.(5)为每个预测变量准备部分回归图。您的阴谋是否暗示
拟合回归函数中的回归关系不适用于任何预测变量
变量?说明。
4
根据相加变量图,即使e(x2 | other)与e(y | others)比其他图相对平坦,
我们仍然可以看到每个预测变量都可以作为回归模型的有用补充。
4.(5)是否有偏远的Y观测? (显示诊断图并根据学生情况进行测试
剩余的)。
g = 44,alpha = 0.05,p = 4
Bonferroni临界值= 3.450183
如果alpha = 0.1
Bonferroni临界值= 3.2667
5
如果是30,| -3.329 | > 3.2267
如果alpha等级= 0.05,则绝对ti值不大于| 3.450183 |,因此我们得出结论,没有
外围的观察。
但是,如果alpha级别= 0.1,则情况30是y的旁观观察。
5.(5)是否有任何X边的观测值? (显示诊断图并根据Hat值进行测试)。
在情况24中,hii = 0.21853> 0.181818
在情况37下,hii = 0.23044> 0.181818
有两个外部x变量,分别是案例24和案例37
6.(5)有影响力吗? (显示库克的距离图,并根据库克的距离进行测试
距离)
6
库克距离有3个潜在变量。它们是:案例6,案例16
和案例30。
MSE = 3.33,
e6 = 9.34 –(1.0233 + 0.9657 * 6.1+ 0.6292 * 5.8 + 0.6754 * 3.4)= -3.51979
= -3.52159888(从R输出)
e16 = -2.79435(从R输出)
e30 = -5.42165061(从R输出)
D6 =(-3.52159888)2
/(4(3.33))(0.08230536/(1-0.08230536)2
)= 0.09099
D16 =(-2.79435)2
/(4(3.33))(0.12790933/(1-0.12790933)2
)= 0.09859
D30 =(-5.42165061)2
/(4(3.33))(0.04798828/(1-0.04798828)2
)= 0.1168
这三个F值均小于20%的临界值。这意味着没有
有影响力的离群值。
7. Is there a serious multicollinearity problem?
a) (4) Include an appropriate scatterplot and correlation values between the explanatory variables.
There is a multicollinearity problem between x1 and x2 based on the correlation and scatterplot.
7
b) (4) Judge by VIF, do you think there is a problem with multicollinearity? (Hint: VIP or tolerance)
Since the VIF of X1 associate with X2 and X3 is 20.07
The VIF for X2 associated with X1 and X3 is 20.7
The VIF for X3 associated with X1 and X2 is 1.2
The VIF for X3 associated with X1 is 1.2, with X2 is 1.2
The VIF for X1 associated with X2 is 19.8
There is a multicollinearity problem in X1 and X2
(2) Compare your answers in parts a) and b). Are your conclusions the same or different? Please explain
your answer.
The answers mean the same conclusion: x1 is highly related to x2.
8. Instead of removing variables, we are going to use the Ridge Regression to determine the parameter values.
a) (5) Make a ridge trace plot. What value of the parameter (? ?? ?) do you believe is best? Explain your
choice.
Based on the output, ? = 1 is the best value.
b) (5) Using the VIF factors, what value of the parameter do you believe should be used? (Hint: Look at both
the graph and the printed numbers.) Explain your choice
8
9
Among all k values, VIF values of the parameters are closest to 1 when k = 0.1.
c) (5) Report your model.
y = 1.333 + 0.7763×1+0.7571×2+0.6565×3
9.(25) A personnel officer in a governmental agency administered four newly developed aptitude tests
to each of the 25 applicants for entry level clerical positions in the agency. For purpose of study, all 25
applicants were accepted for positions irrespective of their test scores. After a probationary period, each
10
applicant was rated for proficiency on the job. The scores on the four tests (X1, X2, X3, X4) and the job
proficiency score (Y) for the 25 employees were recorded in proficiency.csv
a). (5) Obtain the scatter plot matrix and the correlation matrix of the X variables, what do the scatter
plots suggest about the nature of the function relationship between the response variable and each of the
predictor variables?
It is possible that x3 and x4 have multicollinearity problem.
Y seems to be more related with x3 and x4 than x2 or x1.
b). (5) Fit the multiple function containing all four predictors at first-order terms. Does it appear that
all predictor variables should be retained?
11
Based on the output, the p values are all smaller than 0.05 except for x2. Therefore, not all predictors
should be retained, and x2 can be dropped.
c). (10) Use the proficiency data, select the best subset regression models according to the
????
2
, ??, ????, ????, ??? ????? and discuss your selection.
12
Corresponded model based on small Cp: y~x1+x3+x4
Corresponded model based on large adjusted R2
: y~x1+x3+x4
The smallest AICp is 73.847.
Its corresponded model is y~x1+x3+x4
The smallest SBCp is 78.723
Its corresponded model is y~x1+x3+x4
The smallest PRESS is 471.452
Its corresponded model is y~x1+x3+x4
Therefore, the best subset regression model is y = ?0 + ?1×1 + ?3×3 + ?4×4
d). (5) Run a 5 fold cross validation on the model identified in c).
The model selected from c has the smaller RMSE of 4.23.