这个作业是完成鸟类和体重相关的统计问题

STATS 330

Part A
1. [32分]一位保护主义者对探索保护区的影响感兴趣。
栖息地结构的变化对低层觅食鸟类密度的影响
私人储备。以前在
储备以限制野生动物的活动。栅栏的北侧
经常目击野生动物被分类为严重吃草,而
栅栏的南侧被分类为中度玻璃,很少
目睹野生动物。
在同一地点的多个地点进行了鸟类调查
库存栅栏的两侧。在20分钟内,
记录在每个站点观察或听到的鸟类。
然后发生了一次野生动物的淘汰。
然后在相同的地点进行了20分钟的第二次鸟类调查
期。环保主义者想探讨最初的影响
放牧条件和剔除对目击鸟类数量的影响。
放牧站点的放牧状态。 “沉重”或“中等”
剔除剔除状态。第一次鸟测是“之前”,或者是“之后”
第二次鸟类调查
在20分钟的鸟类调查中观察到或听到的鸟类数量
上述变量的特定组合
数据集中总共有62行。前五行显示如下:
放牧号
1 0中度之前
2 3中度之前
3 1中度
4 19中度之前
5 8中度之前
以下模型适合这些数据:
> servation.fit1 <-glm(数字〜剔除*放牧,泊松,servation.df)
>摘要(conservation.fit1)

呼叫:
glm(公式=数量〜剔除*放牧,家庭=泊松,数据=守恒.df)
偏差残差:
最小值1Q中位数3Q最大值
-4.805 -2.452 -1.020 0.234 8.936
系数:
估计标准误差z值Pr(> | z |)
(拦截)1.8563 0.0884 21.00 <2e-16 ***
CullBefore -0.6776 0.1523 -4.45 8.6e-06 ***
放牧中度0.4463 0.1300 3.43 6e-04 ***
剔除前:放牧中度0.8213 0.2004 4.10 4.2e-05 ***

签名编码:0’***’0.001’**’0.01’*’0.05’。’ 0.1”1
(泊松族的色散参数取为1)
零偏差:61个自由度上为528.07
残余偏差:在58个自由度上为437.23
AIC:624.7
Fisher计分迭代次数:6
(a)写出方程式以充分描述拟合模型的守恒1。
[5分]
(b)解释模型Eserv 1.fit中的Cull效应。虽然
首选置信区间解释,此信息尚未
已提供。因此,您的解释需要以
点估计。 [5分]
(c)基于servation.fit1的偏差,有证据表明
该模型不适合数据?解释你的答案。 [2分]
由于过度分散的问题,因此在这些模型上安装了第二个模型
数据:
>图书馆(“ MASS”)
> servation.fit2 <-glm.nb(Number〜Cull * Grazing,data = conservation.df)
(d)servation.fit2模型如何处理过度分散的问题? [3分]
(e)考虑一下summary(conservation.fit2)-故意未显示。
(i)估计数与来自
servation.fit1?证明你的答案。
(ii)标准错误与来自
servation.fit1?证明你的答案。

[5分]
> logLik(conservation.fit2)
“日志库”。 -180.06(df = 5)
(f)计算模型守恒的AIC和BIC。 [4分]
(g)使用AIC作为标准,您希望在哪种模型之间
servation.fit1和servation.fit2?证明你的答案。
[3分]
(h)如果在同一地点进行的第二次鸟类调查是在30分钟而不是20分钟内进行的,那将会带来什么后果
期?建议解决方案。 [5分]

PART B
2. [8 Marks] Approximately, the weights of adult USA women has a mean
of 70.3 kg with a SD of 16.8 kg. Similarly, the weights of adult USA men
has a mean of 78.0 kg with a SD of 13.2 kg. Suppose that exactly half of the
population is of each gender and that weights are normally distributed. Use R
to solve the following problems—copy-and-paste your R command and output.
(a) Early last year Donald Trump weighed 243 pounds upon a physical examination. Compared to adult males from USA, he is in the upper what
percentile? That is, compute the proportion of adult males from USA
heavier than him. Use 1 kg = 2.2046 pounds. [2 marks]
(b) An adult person from USA happens to be 80kg. Approximately, how many
times is it more likely to be a male relative to a female? [2 marks]
(c) What is the probability the weight of a randomly chosen man exceeds the
weight of a randomly chosen woman, both adults from USA, by 20kg or
more? [2 marks]
(d) If a woman is in the lower quartile of weights of the adult female USA
population, what is the upper bound for her weight? [2 marks]

3. [9 Marks] Behind each of the following R code chunks is a purpose, e.g.,
it illustrates a statistical concept or result or idea.
(a) Consider the following R code.
N <- 3
Nsim <- 1e2
lambda <- 4
means <- numeric(Nsim)
for (i in 1:Nsim) {
xvec <- rpois(N, lambda)
means[i] <- mean(xvec)
}
hist(means)
(i) What statistical concept is illustrated by the output of the last line,
hist(means)? [1 mark]
(ii) What happens when only N is allowed to increase? Give the name of
any theoretical result associated with it. [2 marks]
(iii) What happens when only Nsim is allowed to increase? [1 mark]
(iv) What happens when only lambda is allowed to increase? [2 marks]
(b) Consider the following R code.
Nsim <- 1e2
N <- 10
sigma <- 0.5
xy0.df <- data.frame(x2 = runif(N))
meanfun <- function(x) -1.5 + 4 * x
xy0.df <- transform(xy0.df, y = meanfun(x2) + rnorm(N, 0, sigma))
myvec <- numeric(Nsim)
for (i in 1:Nsim) {
xy0.df <- transform(xy0.df, ysim = meanfun(x2) + rnorm(N, 0, sigma))
mod_i <- lm(ysim ~ x2, data = xy0.df)
myvec[i] <- sum(resid(mod_i)^2) / df.residual(mod_i)
}
mean(myvec)
Comment on the output of the last line, mean(myvec)—what statistical
concept does it illustrate?—and is it successful? [3 marks]

4. [17 Marks] To study junior school children in New Zealand, a random
sample of 500 boys and 500 girls was selected from all possible New Zealand
primary school children. The data are in a data frame called kids.df, having
the following columns.
age in years
sex 1 = female, 0 = male
height in metres
weight in kg
tv number of televisions at home
dbp diastolic blood pressure in mm Hg
siblings number of other brothers and sisters living at home
bothparents both parents living at home? 1 = yes, 0 = no
bovs born overseas? 1 = yes, 0 = no
Write a one or two line R LM or GLM to investigate the following research
questions, e.g.,
lm(weight ~ siblings, data = kids.df)
(a) Of interest is to examine whether the number of televisions at home is
related to how many people there are living at home. Briefly comment on
your model, e.g., the assumptions made on the variable or variables used.
[4 marks]
(b) Of interest is to find if any variables can explain (any) obesity in the child,
as measured by body mass index (BMI; kg/m2). [3 marks]
(c) Does the e↵ect of the family size on the probability of both parents living
at home depend on whether or not the children have an overseas influence
or connection? [3 marks]
Now suppose we are interested in fitting some GAMs using gam.
(d) Write your answer to (b) as a GAM instead of a GLM. Which model would
you fit first—the GLM or GAM—and why? [4 marks]
(e) Comment on the appropriateness of fit1 below. [3 marks]
fit1 <- gam(height ~ tv + s(age, df = 15) + sex + s(bovs), poisson,
data = kids.df)

5. [6 Marks] A statistical model has likelihood function L(p) / exp( 3
2 p2)
where p is the number of variables in the model (up to a maximum of 10).
Given a data set, we want to choose the number of variables using a penalty
function approach. Suppose the balancing parameter to be used is = 2 and
the penalty is B = (p 1)2. Using the negative log-likelihood as A, find the
optimal number of variables based on the quantity A + B.
6. [7 Marks] A large data set is separated into a training set and a test set.
(a) Is it necessary to do this randomly? Why or why not? [3 marks]
(b) In R how might this separation be done in a reproducible way? [1 mark]
(c) The statistician chooses 20% of the data for training and 80% for testing.
Comment briefly on this—2 or 3 lines would be plenty. [3 marks]

7. [14 Marks] Consider the following code concerning a small study of children who have had corrective spinal surgery. The response was whether kyphosis
(a type of deformation) was present or absent after the operation. The variable
Start measures the number of the first (topmost) vertebra operated on.
> data(kyphosis, package = “rpart”)
> ooo <- with(kyphosis, order(Start))
> kyphosis <- kyphosis[ooo, ]
> kfit1 = glm(Kyphosis ~ Start, binomial, data = kyphosis)
> summary(kfit1)
Call:
glm(formula = Kyphosis ~ Start, family = binomial, data = kyphosis)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.473 -0.518 -0.421 -0.341 2.131
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8901 0.6300 1.41 0.15769
Start -0.2179 0.0604 -3.61 0.00031 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83.234 on 80 degrees of freedom
Residual deviance: 68.072 on 79 degrees of freedom
AIC: 72.07
Number of Fisher Scoring iterations: 5
> vcov(kfit1)
(Intercept) Start
(Intercept) 0.396849 -0.0332694
Start -0.033269 0.0036529
> plot(jitter(kfit1$y) ~ Start, data = kyphosis, col = “blue”,
ylab = “Response or probability”,
main = “Jittered Response of kyphosis Data”, las = 1)
> lines(fitted(kfit1) ~ Start, data = kyphosis, col = “darkgreen”)

5 10 15
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Jittered Response of kyphosis Data
Start
Response or probability
(a) Why was jitter() used? [1 mark]
(b) Why was the data frame sorted by Start? [1 mark]
(c) Is there any evidence of overdispersion with respect to the binomial? Justify your answer, e.g., by some numerical evidence. [3 marks]
Now consider the following code. We want to estimate ✓ = the value of Start
such that there is a 25% probability of presence of kyphosis, as well as obtain
an approximate 95% confidence interval for ✓.
> cfs <- coef(kfit1)
> eta = cfs[1] + cfs[2] * with(kyphosis, Start)
> n.obs = nrow(kyphosis)
> Nsim = 1e3
> betas = matrix(0, Nsim, length(cfs))
> for (i in 1:Nsim) {
ysim = rbinom(n.obs, size = 1, prob = 1 / (1 + exp(-eta)))
mod_i = glm(cbind(ysim, 1-ysim) ~ Start, binomial, data = kyphosis)
betas[i, ] = coef(mod_i)
}
> est = (-1.0986 – cfs[1]) / cfs[2]
> myci <- c(2*est-quantile((-1.0986-betas[, 1])/betas[, 2], prob=0.975),
2*est-quantile((-1.0986-betas[, 1])/betas[, 2], prob=0.025))
> plot(jitter(kfit1$y) ~ Start, data = kyphosis, col = “blue”,
ylab = “Response or probability”,
main = “Jittered Response of kyphosis Data”, las = 1)
> lines(fitted(kfit1) ~ Start, data = kyphosis, col = “darkgreen”)
> abline(v = c(est, myci), col = “red3”, lty = “dashed”)
> abline(h = 0.25, col = “gray”, lty = “dashed”)

5 10 15
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Jittered Response of kyphosis Data
Start
Response or probability
(d) Why was size = 1 used? [1 mark]
(e) Is this parametric bootstrapping, nonparametric bootstrapping or neither?
[1 mark]
(f) Explain where the -1.0986 came from. [1 mark]
(g) Obtain an approximate 95% confidence interval for the probability of
kyphosis when Start = 10. [4 marks]
(h) Using mgcv what commands would you use to fit a GAM to replace the
GLM kfit1. Don’t run it. [2 marks]

8. [7 Marks] The following are the performance characteristics of B-type natriuretic peptide as a diagnostic test for congestive heart failure (CHF). The
study was done around the year 2000, and the new diagnostic test was considered positive if Serum BNP > 100 pg/mL. The diagnostic test was trialled on
1586 patients. Of 744 patients that were disease positive, 670 tested positive.
Of 842 patients that were disease negative, 640 tested negative.
(a) What is the overall prevalence of the disease? [1 mark]
(b) What is the sensitivity of the test? [1 mark]
(c) What is the specificity of the test? [1 mark]
(d) Given somebody with a negative test, estimate the probability of having
no disease. [2 marks]
(e) For this example, from a patient’s point of view, is a false positive better
or worse, or neither, than a false negative? Why? [2 marks]