这个作业是用R语言完成构造和评估分类器,预测一个标本所属的物种

STAT3006 Assignment 3, 2020

这项任务涉及构造和评估分类器。我们将首先考虑
现在熟悉的Iris数据集,由安德森(1936)收集和第一次统计分析
Fisher(1936),这需要一个人来解决物种问题,那就是预测
(已知)一个给定标本所属的物种。
第二个用来训练分类器的数据集是修改后的国家研究所
手写数字的标准与技术(MNIST)数据库。这包含了70000
图片,其中有10,000张保留用于测试。在每种情况下,您将使用给定的
并尝试构造能够对未标记数据进行准确分类的分类器
观察。
任务:
1. 对于一个基于概率的分类器和一个非基于概率的分类器
用于回答以后的问题,数学地描述方法。包括一个面向数学的概述,说明您建议如何选择该方法的所有参数。(三是)
2. 将一个基于概率的分类器和一个非基于概率的分类器应用于虹膜数据集
使用R,报告结果并解释它们。
每个分类器的结果应该包括以下内容:
a)每个类的特征(包括参数估计)。马克[1]
b)各类和总体的表观错误率:通过对分类器进行训练得到
整个数据集,然后将其应用到相同的数据集并检查错误率。马克[1]
c)基于交叉验证(CV)的总体错误率和特定类错误率的估计:已获得
通过在整个数据集的很大一部分上训练分类器,然后将其应用到
剩余数据和检查错误率。您可以使用5次、10次或单次交叉验证来评估性能,但是您应该为您的选择给出一个统计上的理由。
还包括每个错误率的大约95%置信区间,以及a
描述这是如何获得的。(2分)
估计错误率的一个选项是通过包含函数的R包ipred
称为错误率估计。通过一些操作,errorest应该能够做到
为大多数分类器生成CV估计。然而,还有许多可选的包。
d)当预测类应用于数据和数据空间(包括可视化)时的图形
决策边界的表示,包括所有唯一的解释变量对。
对一组放大的图形(每个轴的范围更宽)重复此操作,以检查新的异常值
将被处理-如果你看到任何令人惊讶的评论。(2分)
e)查找、列出并讨论明显误差中分类错误的虹膜观测结果
速率和简历检查。马克[1]
f)比较和对比两者产生的类之间的决策边界
方法,并试着解释它们的形状。你认为哪种方法是最好的呢
数据集?解释一下。描述两种方法中你认为合适的方面
不适合这个分类问题。(2分)
g)现在要求你预测从某个地区收集到的新观察的类别
setosa、virginica和versicolor的级别比例分别为0.2、0.2和0.6,
分别。描述(用数学细节)你将如何改变或改装每一个
在这些情况下,分类器给它可能的最佳预测错误率。改变
或者根据需要重新构建分类器,并报告对新分类器的点估计
参数。解释这些变化的性质。(三是)
注:我们不会将我们的分类结果与Fisher 1936年的论文“The”进行比较
在分类学问题中使用多重测量”。然而,它值得一读
论文为背景数据集和它的一些目的分析。
3. Choose two methods of classification that you have not used on the Iris dataset and apply
them to the MNIST dataset. You are welcome to
combine techniques if you wish. Leave the train and test split as it is, but feel free to use
some of the training data to help choose a model, if desired. Aim for best possible predictive
performance, but view this as primarily a learning exercise. I.e. you do not need to choose
methods with the best performance. However, you should aim to get reasonable performance
out of any method chosen, e.g. with reasonable choice of any hyperparameters. You should
not pre-process the data in a way which makes use of your knowledge of the digit recognition
problem, i.e. don’t try to produce new explanatory variables which represent image features,
even though this would likely help performance. You can use dimension reduction if you
wish (e.g. PCA).
(a) Give a brief introduction to the dataset, including quantitative aspects. [1 mark]
(b) Give a summary of the predictive performance on the test set for each classifier. Make
sure you do not use the test set at all before doing this. Include at least estimated overall error
rate and class-specific error rates, along with approximate 95% confidence intervals for these.
[2 marks]
(c) For each classifier, also report error rates as estimated using the training set. Attempt to
explain any differences between the error rates estimated from the training and test sets. Note
that reference to training and test sets here are to the labelling of the original data, not to how
you may have used them. [2 marks]
(d) Explain why you chose each classifier type and describe some of their apparent strengths
and weaknesses for this problem. [2 marks]
(e) For each classifier, show 1 example per digit (i.e. 20 in total) of handwritten digits which
were classified into the correct class with the most certainty, and quantify what you mean by
certainty. Explain why you think the classifiers were particularly successful at classifying
these correctly and with certainty. [3 marks]
(f) For each classifier, show 1 example per digit (i.e. 20 in total) of the worst errors made by
your classifier and quantify what you mean by worst. Explain why you think some of these
errors may have been made by your classifier and been among the worst seen. [2 marks]
(g) What is the difference between a handwritten 7 and a 1 according to each classifier? Try
to explain what each classifier is doing in this case, i.e. what are the main things the classifier
considers to make this decision and how are they used? [3 marks]