STATS 330

Part A
1. [32分]一位保护主义者对探索保护区的影响感兴趣。

1 0中度之前
2 3中度之前
3 1中度
4 19中度之前
5 8中度之前

> servation.fit1 <-glm（数字〜剔除*放牧，泊松，servation.df）
>摘要（conservation.fit1）

glm（公式=数量〜剔除*放牧，家庭=泊松，数据=守恒.df）

-4.805 -2.452 -1.020 0.234 8.936

（拦截）1.8563 0.0884 21.00 <2e-16 ***
CullBefore -0.6776 0.1523 -4.45 8.6e-06 ***

（泊松族的色散参数取为1）

AIC：624.7
Fisher计分迭代次数：6
（a）写出方程式以充分描述拟合模型的守恒1。
[5分]
（b）解释模型Eserv 1.fit中的Cull效应。虽然

（c）基于servation.fit1的偏差，有证据表明

>图书馆（“ MASS”）
> servation.fit2 <-glm.nb（Number〜Cull * Grazing，data = conservation.df）
（d）servation.fit2模型如何处理过度分散的问题？ [3分]
（e）考虑一下summary（conservation.fit2）-故意未显示。
（i）估计数与来自
servation.fit1？证明你的答案。
（ii）标准错误与来自
servation.fit1？证明你的答案。

[5分]
> logLik（conservation.fit2）
“日志库”。 -180.06（df = 5）
（f）计算模型守恒的AIC和BIC。 [4分]
（g）使用AIC作为标准，您希望在哪种模型之间
servation.fit1和servation.fit2？证明你的答案。
[3分]
（h）如果在同一地点进行的第二次鸟类调查是在30分钟而不是20分钟内进行的，那将会带来什么后果

PART B
2. [8 Marks] Approximately, the weights of adult USA women has a mean
of 70.3 kg with a SD of 16.8 kg. Similarly, the weights of adult USA men
has a mean of 78.0 kg with a SD of 13.2 kg. Suppose that exactly half of the
population is of each gender and that weights are normally distributed. Use R
to solve the following problems—copy-and-paste your R command and output.
(a) Early last year Donald Trump weighed 243 pounds upon a physical examination. Compared to adult males from USA, he is in the upper what
percentile? That is, compute the proportion of adult males from USA
heavier than him. Use 1 kg = 2.2046 pounds. [2 marks]
(b) An adult person from USA happens to be 80kg. Approximately, how many
times is it more likely to be a male relative to a female? [2 marks]
(c) What is the probability the weight of a randomly chosen man exceeds the
weight of a randomly chosen woman, both adults from USA, by 20kg or
more? [2 marks]
(d) If a woman is in the lower quartile of weights of the adult female USA
population, what is the upper bound for her weight? [2 marks]

3. [9 Marks] Behind each of the following R code chunks is a purpose, e.g.,
it illustrates a statistical concept or result or idea.
(a) Consider the following R code.
N <- 3
Nsim <- 1e2
lambda <- 4
means <- numeric(Nsim)
for (i in 1:Nsim) {
xvec <- rpois(N, lambda)
means[i] <- mean(xvec)
}
hist(means)
(i) What statistical concept is illustrated by the output of the last line,
hist(means)? [1 mark]
(ii) What happens when only N is allowed to increase? Give the name of
any theoretical result associated with it. [2 marks]
(iii) What happens when only Nsim is allowed to increase? [1 mark]
(iv) What happens when only lambda is allowed to increase? [2 marks]
(b) Consider the following R code.
Nsim <- 1e2
N <- 10
sigma <- 0.5
xy0.df <- data.frame(x2 = runif(N))
meanfun <- function(x) -1.5 + 4 * x
xy0.df <- transform(xy0.df, y = meanfun(x2) + rnorm(N, 0, sigma))
myvec <- numeric(Nsim)
for (i in 1:Nsim) {
xy0.df <- transform(xy0.df, ysim = meanfun(x2) + rnorm(N, 0, sigma))
mod_i <- lm(ysim ~ x2, data = xy0.df)
myvec[i] <- sum(resid(mod_i)^2) / df.residual(mod_i)
}
mean(myvec)
Comment on the output of the last line, mean(myvec)—what statistical
concept does it illustrate?—and is it successful? [3 marks]

4. [17 Marks] To study junior school children in New Zealand, a random
sample of 500 boys and 500 girls was selected from all possible New Zealand
primary school children. The data are in a data frame called kids.df, having
the following columns.
age in years
sex 1 = female, 0 = male
height in metres
weight in kg
tv number of televisions at home
dbp diastolic blood pressure in mm Hg
siblings number of other brothers and sisters living at home
bothparents both parents living at home? 1 = yes, 0 = no
bovs born overseas? 1 = yes, 0 = no
Write a one or two line R LM or GLM to investigate the following research
questions, e.g.,
lm(weight ~ siblings, data = kids.df)
(a) Of interest is to examine whether the number of televisions at home is
related to how many people there are living at home. Briefly comment on
your model, e.g., the assumptions made on the variable or variables used.
[4 marks]
(b) Of interest is to find if any variables can explain (any) obesity in the child,
as measured by body mass index (BMI; kg/m2). [3 marks]
(c) Does the e↵ect of the family size on the probability of both parents living
at home depend on whether or not the children have an overseas influence
or connection? [3 marks]
Now suppose we are interested in fitting some GAMs using gam.
you fit first—the GLM or GAM—and why? [4 marks]
(e) Comment on the appropriateness of fit1 below. [3 marks]
fit1 <- gam(height ~ tv + s(age, df = 15) + sex + s(bovs), poisson,
data = kids.df)

5. [6 Marks] A statistical model has likelihood function L(p) / exp( 3
2 p2)
where p is the number of variables in the model (up to a maximum of 10).
Given a data set, we want to choose the number of variables using a penalty
function approach. Suppose the balancing parameter to be used is = 2 and
the penalty is B = (p 1)2. Using the negative log-likelihood as A, find the
optimal number of variables based on the quantity A + B.
6. [7 Marks] A large data set is separated into a training set and a test set.
(a) Is it necessary to do this randomly? Why or why not? [3 marks]
(b) In R how might this separation be done in a reproducible way? [1 mark]
(c) The statistician chooses 20% of the data for training and 80% for testing.
Comment briefly on this—2 or 3 lines would be plenty. [3 marks]

7. [14 Marks] Consider the following code concerning a small study of children who have had corrective spinal surgery. The response was whether kyphosis
(a type of deformation) was present or absent after the operation. The variable
Start measures the number of the first (topmost) vertebra operated on.
> data(kyphosis, package = “rpart”)
> ooo <- with(kyphosis, order(Start))
> kyphosis <- kyphosis[ooo, ]
> kfit1 = glm(Kyphosis ~ Start, binomial, data = kyphosis)
> summary(kfit1)
Call:
glm(formula = Kyphosis ~ Start, family = binomial, data = kyphosis)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.473 -0.518 -0.421 -0.341 2.131
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 0.8901 0.6300 1.41 0.15769
Start -0.2179 0.0604 -3.61 0.00031 ***

Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ‘ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 83.234 on 80 degrees of freedom
Residual deviance: 68.072 on 79 degrees of freedom
AIC: 72.07
Number of Fisher Scoring iterations: 5
> vcov(kfit1)
(Intercept) Start
(Intercept) 0.396849 -0.0332694
Start -0.033269 0.0036529
> plot(jitter(kfit1\$y) ~ Start, data = kyphosis, col = “blue”,
ylab = “Response or probability”,
main = “Jittered Response of kyphosis Data”, las = 1)
> lines(fitted(kfit1) ~ Start, data = kyphosis, col = “darkgreen”)

5 10 15
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Jittered Response of kyphosis Data
Start
Response or probability
(a) Why was jitter() used? [1 mark]
(b) Why was the data frame sorted by Start? [1 mark]
(c) Is there any evidence of overdispersion with respect to the binomial? Justify your answer, e.g., by some numerical evidence. [3 marks]
Now consider the following code. We want to estimate ✓ = the value of Start
such that there is a 25% probability of presence of kyphosis, as well as obtain
an approximate 95% confidence interval for ✓.
> cfs <- coef(kfit1)
> eta = cfs + cfs * with(kyphosis, Start)
> n.obs = nrow(kyphosis)
> Nsim = 1e3
> betas = matrix(0, Nsim, length(cfs))
> for (i in 1:Nsim) {
ysim = rbinom(n.obs, size = 1, prob = 1 / (1 + exp(-eta)))
mod_i = glm(cbind(ysim, 1-ysim) ~ Start, binomial, data = kyphosis)
betas[i, ] = coef(mod_i)
}
> est = (-1.0986 – cfs) / cfs
> myci <- c(2*est-quantile((-1.0986-betas[, 1])/betas[, 2], prob=0.975),
2*est-quantile((-1.0986-betas[, 1])/betas[, 2], prob=0.025))
> plot(jitter(kfit1\$y) ~ Start, data = kyphosis, col = “blue”,
ylab = “Response or probability”,
main = “Jittered Response of kyphosis Data”, las = 1)
> lines(fitted(kfit1) ~ Start, data = kyphosis, col = “darkgreen”)
> abline(v = c(est, myci), col = “red3”, lty = “dashed”)
> abline(h = 0.25, col = “gray”, lty = “dashed”)

5 10 15
−0.2
0.0
0.2
0.4
0.6
0.8
1.0
1.2
Jittered Response of kyphosis Data
Start
Response or probability
(d) Why was size = 1 used? [1 mark]
(e) Is this parametric bootstrapping, nonparametric bootstrapping or neither?
[1 mark]
(f) Explain where the -1.0986 came from. [1 mark]
(g) Obtain an approximate 95% confidence interval for the probability of
kyphosis when Start = 10. [4 marks]
(h) Using mgcv what commands would you use to fit a GAM to replace the
GLM kfit1. Don’t run it. [2 marks]

8. [7 Marks] The following are the performance characteristics of B-type natriuretic peptide as a diagnostic test for congestive heart failure (CHF). The
study was done around the year 2000, and the new diagnostic test was considered positive if Serum BNP > 100 pg/mL. The diagnostic test was trialled on
1586 patients. Of 744 patients that were disease positive, 670 tested positive.
Of 842 patients that were disease negative, 640 tested negative.
(a) What is the overall prevalence of the disease? [1 mark]
(b) What is the sensitivity of the test? [1 mark]
(c) What is the specificity of the test? [1 mark]
(d) Given somebody with a negative test, estimate the probability of having
no disease. [2 marks]
(e) For this example, from a patient’s point of view, is a false positive better
or worse, or neither, than a false negative? Why? [2 marks] EasyDue™ 支持PayPal, AliPay, WechatPay, Taobao等各种付款方式!

E-mail: easydue@outlook.com  微信:easydue

EasyDue™是一个服务全球中国留学生的专业代写公司