这个作业是用R语言完成几个测试小题目
STAC51 (Winter 2020): Final Exam
Note: In any question, if you are using R, all R codes and R outputs must be included in your
1
Crash Day
Before After
193
49
8
23 21
55 28 57
1. The numbers in the Figure above indicate the weather (overcast or not) of 434 locationmatched triplets of days, one day on which a traffic accident took place, and two control days
without an accident (the day before the accident and the day after the accident). This dataset
could be analyzed as a 1:2-matched case-control. The Venn diagram presentation of the data
is rather unconventional. In matched case-control studies the data could alternatively be
presented in 2 × 2-tables
exposed unexposed
case ai bi
control ci di
for each matched set i = 1, . . . , 434. We denote ni = ai + bi + ci + di
.
(a) [10 Marks] We note that there are six types of location-specific 2 × 2-tables with the
same exposure-case configuration. List these tables (i.e. different combinations of the
numbers ai
, bi
, ci and di) and their counts.
(b) [6 Marks] The null hypothesis assumes that there is no relationship between being a
case and being exposed. Under the null hypothesis the distribution of the cell count ai
conditional on the row and column marginals is hypergeometric. Find E(ai
| ai + ci)
and Var(ai
| ai + ci) under the null.
(c) [6 Marks] Test the null hypothesis of no association between weather and accidents
using the Cochrane-Mantel-Haenszel (CMH) test statistic, given by
P434
i=1 ai −
P434
i=1 E(ai
|ai + ci)
2
P434
i=1 Var(ai
| ai + ci)
,
which is asymptotically distributed as χ
2 with one degree of freedom.
Note: χ
2
0.95(1) = 3.84.
2
(d) [4 Marks] Recall that for 1:1 matching there exist 4 unique types of CMH tables. For
1:2 matching there exist 6 unique types of tables. If we have a 1:k matched case control
study how many, unique types of tables exist? Here k < ∞.
(e) [10 Marks] Let’s assume we have the following table is a triplet specific contingency
table from a 1:2 matched case control study.
exposed unexposed Total
case a b 1
control c d 2
Total a + c b + d 3
The odds of being exposed in the case groups is θ times of the odds of being exposed in
the control group. Also, let’s assume that P(a = 1) = θΩ
1 + θΩ
and P(c = 1) = Ω
1 + Ω
Show that, P(a = 1 | a + c = 1) = θ
2 + θ
(Hint: The 2 in the denominator comes from 2 controls).
2. For this question you need to use the warpbreaks dataset from the datasets package. That
is you need to run the following code,
## Run this code to get the veteran dataset ##
library(datasets)
data(warpbreaks)
You can find the details about the dataset by using ‘?warpbreaks’ code. We are interested
in the count of warp breaks per loom (i.e., variable = ‘breaks’) by wool and tension level.
(a) [8 Marks] Execute a Poisson regression to estimate the mean number of breaks by wool
type and tension level.
(b) [8 Marks] Execute a negative binomial regression to estimate the mean number of
breaks by wool type and tension level.
(c) [6 Marks] Compare the models using the AIC values. Interpret the dispersion parameter
of the negative binomial regression. Which model performed better?
3. For this question you have to simulate a dataset.
(a) [5 Marks] Perform the following simulations.
• Generate 500 random values from X1 ∼ Uniform[0, 1], X2 ∼ Uniform[0, 1], X3 ∼
Uniform[0, 1], X4 ∼ Uniform[0, 1], X5 ∼ Uniform[0, 1]
• Generate, f(X) = 4[sin(πx1x2)+ 8(x3−0.5)3+ 1.5×4−x5−0.77]. Here, π = 3.14…..
• Generate Y ∼ Bernoulli
p(X) = exp(f(X))
1 + exp(f(X))
(b) [10 Marks] Fit a logistic regression where Y is the outcome and X1, X2, …, X5 are the
predictors. Show the coefficients table. Produce the ROC curve. State the AUC value
and interpret.
(c) [10 Marks] Now instead of using the original X1, X2, …, X5 as predictors, transform the
variables in such a way that they resembels the individual terms in f(X). That, is create
new variables from X1, X2, …, X5 in such a way that f(X) is transformed to a linear predictor. Now run a logistic regression using the new variables. Show the coefficients
table. Produce the ROC curve. State the AUC value.
(Hint: You have to create 4 new variables from X1, X2, …, X5)
(d) [7 Marks] Compare your results in (b) and (c): how did your coefficients and AUC
change from (b) to (c)? Explain why you think this happened.