MATH4007/G14CST COMPUTATIONAL STATISTICS
Assessed Coursework 2

f（x; µ0，µ1，p）=​​（1！p）f0（x; µ0）+ pf1（x; µ1），x 2 R，

h的带宽（我们将始终使用标准的标准内核，这是
R中的密度函数）。

f，为✓构造Bootstrap间隔并计算包含多少间隔
✓的真实值。具体而言，请遵循以下步骤：

•始终使用n = 100的样本量。
•对于h的固定值，请执行以下操作：
–从f模拟n = 100个数据点。
–使用统计为ˆ✓（h）的非参数引导程序产生95％的置信度
✓使用此模拟样本的间隔。存储此间隔。
–重复这两个步骤以产生适当数量的间隔，

–使用✓的真实值测试每个间隔是否包含真实值，

•针对各种h，报告和讨论结果执行此操作。您可以使用密度

2.（a）Moodle文件“ CW2Data.txt”中的数据给出了100个观测值xi，i = 1，…，100，

（a）部分的任务：制作合适的地块，并对要使用的提案进行评论

Questions
1. In general, suppose a method for constructing a conﬁdence interval for some quantity of
interest ✓ exists. The coverage of the method is the actual probability (often expressed
equivalently as a percentage) that, given a random sample of data, the conﬁdence interval
constructed from this sample will contain the true value of ✓.The nominal coverage is the
conﬁdence level we choose, e.g. we often choose to construct 95% conﬁdence intervals.
We’d like the coverage of the procedure to be equal to the nominal coverage – e.g. when
we construct 95% conﬁdence intervals, we’d like the coverage to be 0.95 (or 95%), so that
intervals will contain the true value of ✓ 95% of the time. If the coverage and the nominal
coverage are equal, then the procedure is said to have exact coverage.
In some speciﬁc cases, we can construct intervals which have exact coverage. For instance,
when data are normally distributed, the usual conﬁdence interval for the mean is derived
straight from the true sampling distribution of the sample mean (the estimator), and hence
has exact coverage. Usually though, intervals are constructed from approximations, using
e.g. asymptotic theory. We can investigate the coverage properties of a proposed method
using simulation.
Here, we’ll investigate the coverage of bootstrap conﬁdence intervals, where ✓ is the true
value of a probability density function and ˆ ✓ is the kernel density estimator.
Consider the mixture density
f(x; µ0,µ1,p)=(1 ! p)f0(x; µ0)+ pf1(x; µ1),x 2 R,
where p 2 (0, 1),µ0 2 R,µ1 2 R are parameters and f0 and f1 are normal densities with
variance 1 and mean µ0 and µ1 respectively. Let p =0.3, µ0 =0, µ1 =3.Now,let ✓
be the true value of the density at x =0 (i.e. ✓ = f(0; µ0 =0,µ1 =3,p =0.3).Then,
given a sample of n data points from f, ˆ ✓(h) is the kernel density estimate of ✓ using a
bandwidth of h (we’ll use the standard normal kernel throughout, which is the default for
the density function in R).
The objective is to perform a simulation study to investigate how the coverage of the
bootstrap conﬁdence interval for ✓ depends on h,byrepeatedlysimulatingsamplesfrom
f,constructingbootstrapintervalsfor ✓,andcomputinghowmanyintervalscontainthe
true value of ✓.Speciﬁcally,dothisasfollows:

• Use a sample size of n =100 throughout.
• For a ﬁxed value of h,dothefollowing:
– Simulate n =100 data points from f.
– Use the nonparametric bootstrap, with statistic ˆ ✓(h) to produce a 95% conﬁdence
interval for ✓ using this simulated sample. Store this interval.
– Repeat these two steps to produce an appropriately large number of intervals,
each based on a di↵erent simulated sample from f.
– Using the true value of ✓ to test whether each interval contains the true value,
determine the (estimated) coverage of the procedure for this value of h.
• Do this for various h,andreportanddiscussyourﬁndings.Youmayusethe density
command to compute ˆ ✓ within the bootstrap algorithm, but you must code the actual
bootstrap procedure yourself (and not use boot or similar).

2. (a) The data in the ﬁle “CW2Data.txt” on Moodle give 100 observations xi,i =1,…, 100,
which are the measurements of a certain protein linked to a form of cancer, with higher
levels linked to increased risk. It is proposed to ﬁt a mixture of two normal distributions
to the data, with a view to using the measurements of the protein to cluster into
groups of high and low risk of the cancer. Separately, using other measurements,
the individuals have been labelled by experts as high and low risk, but this is time
consuming and costly so experts would like to know whether the protein measurement
can ultimately be used for classiﬁcation. These labels are not to be used in any of the
model ﬁtting, but will be used to assess classiﬁcation performance in part (d) below.
Task for part (a): Produce suitable plots and comment on the proposal to use
mixture modelling to cluster the data.

EasyDue™ 支持PayPal, AliPay, WechatPay, Taobao等各种付款方式!

E-mail: easydue@outlook.com  微信:easydue

EasyDue™是一个服务全球中国留学生的专业代写公司