本次英国代写主要为统计相关的assessed coursework

MATH4007/G14CST COMPUTATIONAL STATISTICS
Assessed Coursework 2

问题
通常,假设一种构造一定数量的置信区间的方法
兴趣✓存在。该方法的覆盖范围是实际概率(通常表示为
等效地以百分比表示),在给定随机数据样本的情况下,置信区间
由该样本构造的将包含✓的真实值。名义覆盖范围是
我们选择的信任级别,例如我们通常选择构造95%的置信区间。
我们希望该程序的覆盖范围与名义覆盖范围相同,例如什么时候
我们构建95%的置信区间,我们希望覆盖率为0.95(或95%),因此
间隔将包含✓95%的时间的真实值。如果覆盖范围和名义
覆盖范围相等,则可以说该过程具有确切的覆盖范围。
在某些特定情况下,我们可以构造具有精确覆盖范围的区间。例如,
当数据呈正态分布时,得出均值的通常置信区间
直接来自样本均值(估计量)的真实采样分布,因此
有确切的覆盖范围。通常,虽然间隔是根据近似值构造的,
例如渐近理论。我们可以研究一种建议方法的覆盖范围
使用模拟。
在这里,我们将研究引导程序置信区间的覆盖范围,其中✓是真实的
概率密度函数的值,ˆ✓是核密度估计量。
考虑混合物的密度
f(x; µ0,µ1,p)=​​(1!p)f0(x; µ0)+ pf1(x; µ1),x 2 R,
其中p 2(0,1),µ0 2 R,µ1 2 R是参数,f0和f1是法线密度
方差1和均值µ0和µ1。设p = 0.3,µ0 = 0,µ1 = 3。
是x = 0时密度的真值(即✓= f(0; µ0 = 0,µ1 = 3,p = 0.3)。
给定来自f的n个数据点的样本,ˆ✓(h)是✓的核密度估计,使用
h的带宽(我们将始终使用标准的标准内核,这是
R中的密度函数)。
目的是进行仿真研究,以调查
通过反复模拟来自的样本,✓的自举置信区间取决于h
f,为✓构造Bootstrap间隔并计算包含多少间隔
✓的真实值。具体而言,请遵循以下步骤:

•始终使用n = 100的样本量。
•对于h的固定值,请执行以下操作:
–从f模拟n = 100个数据点。
–使用统计为ˆ✓(h)的非参数引导程序产生95%的置信度
✓使用此模拟样本的间隔。存储此间隔。
–重复这两个步骤以产生适当数量的间隔,
每个都基于来自f的不同模拟样本。
–使用✓的真实值测试每个间隔是否包含真实值,
确定该h值的过程的(估计)覆盖率。
•针对各种h,报告和讨论结果执行此操作。您可以使用密度
命令以在Bootstrap算法中计算ˆ✓,但您必须对实际代码进行编码
自己进行引导程序(而不使用引导程序或类似程序)。

2.(a)Moodle文件“ CW2Data.txt”中的数据给出了100个观测值xi,i = 1,…,100,
这是与某种癌症相关的某种蛋白质的测量值,
水平与增加的风险有关。建议拟合两个正态分布的混合
对数据进行分析,以期将蛋白质的测量结果聚类为
高和低风险的人群。另外,使用其他度量,
这些人被专家标记为高风险和低风险,但这是时候了
耗资且昂贵,因此专家想知道蛋白质测量是否
最终可以用于分类。这些标签不得在任何
模型拟合,但将用于评估以下(d)部分中的分类性能。
(a)部分的任务:制作合适的地块,并对要使用的提案进行评论
混合建模以对数据进行聚类。

Questions
1. In general, suppose a method for constructing a confidence interval for some quantity of
interest ✓ exists. The coverage of the method is the actual probability (often expressed
equivalently as a percentage) that, given a random sample of data, the confidence interval
constructed from this sample will contain the true value of ✓.The nominal coverage is the
confidence level we choose, e.g. we often choose to construct 95% confidence intervals.
We’d like the coverage of the procedure to be equal to the nominal coverage – e.g. when
we construct 95% confidence intervals, we’d like the coverage to be 0.95 (or 95%), so that
intervals will contain the true value of ✓ 95% of the time. If the coverage and the nominal
coverage are equal, then the procedure is said to have exact coverage.
In some specific cases, we can construct intervals which have exact coverage. For instance,
when data are normally distributed, the usual confidence interval for the mean is derived
straight from the true sampling distribution of the sample mean (the estimator), and hence
has exact coverage. Usually though, intervals are constructed from approximations, using
e.g. asymptotic theory. We can investigate the coverage properties of a proposed method
using simulation.
Here, we’ll investigate the coverage of bootstrap confidence intervals, where ✓ is the true
value of a probability density function and ˆ ✓ is the kernel density estimator.
Consider the mixture density
f(x; µ0,µ1,p)=(1 ! p)f0(x; µ0)+ pf1(x; µ1),x 2 R,
where p 2 (0, 1),µ0 2 R,µ1 2 R are parameters and f0 and f1 are normal densities with
variance 1 and mean µ0 and µ1 respectively. Let p =0.3, µ0 =0, µ1 =3.Now,let ✓
be the true value of the density at x =0 (i.e. ✓ = f(0; µ0 =0,µ1 =3,p =0.3).Then,
given a sample of n data points from f, ˆ ✓(h) is the kernel density estimate of ✓ using a
bandwidth of h (we’ll use the standard normal kernel throughout, which is the default for
the density function in R).
The objective is to perform a simulation study to investigate how the coverage of the
bootstrap confidence interval for ✓ depends on h,byrepeatedlysimulatingsamplesfrom
f,constructingbootstrapintervalsfor ✓,andcomputinghowmanyintervalscontainthe
true value of ✓.Specifically,dothisasfollows:

• Use a sample size of n =100 throughout.
• For a fixed value of h,dothefollowing:
– Simulate n =100 data points from f.
– Use the nonparametric bootstrap, with statistic ˆ ✓(h) to produce a 95% confidence
interval for ✓ using this simulated sample. Store this interval.
– Repeat these two steps to produce an appropriately large number of intervals,
each based on a di↵erent simulated sample from f.
– Using the true value of ✓ to test whether each interval contains the true value,
determine the (estimated) coverage of the procedure for this value of h.
• Do this for various h,andreportanddiscussyourfindings.Youmayusethe density
command to compute ˆ ✓ within the bootstrap algorithm, but you must code the actual
bootstrap procedure yourself (and not use boot or similar).

2. (a) The data in the file “CW2Data.txt” on Moodle give 100 observations xi,i =1,…, 100,
which are the measurements of a certain protein linked to a form of cancer, with higher
levels linked to increased risk. It is proposed to fit a mixture of two normal distributions
to the data, with a view to using the measurements of the protein to cluster into
groups of high and low risk of the cancer. Separately, using other measurements,
the individuals have been labelled by experts as high and low risk, but this is time
consuming and costly so experts would like to know whether the protein measurement
can ultimately be used for classification. These labels are not to be used in any of the
model fitting, but will be used to assess classification performance in part (d) below.
Task for part (a): Produce suitable plots and comment on the proposal to use
mixture modelling to cluster the data.