Consider a dataset related to red and white variants of the Portuguese “Vinho Verde”
wine. The data file winedata.csv can be downloaded from the course Canvas page. For
more details, you may also see for description of the data and the variables at the website
https://archive.ics.uci.edu/ml/datasets/wine+quality. Due to privacy and logistic issues, only
physicochemical (inputs) and quality (the output) variables are available (e.g. there is no data
about grape types, wine brand, wine selling price, etc.). For all of the data analysis, please
include R codes and the R output in an Appendix to the report.
1. In this dataset, the last column wine indicates whether the wine is a red or white wine
and the variable quality indicates the quality of the wine in a scale of 1-10. The first 11
variables can be used as predictors. Check if the mean of the predictors are significantly
different for the two types of wines. Present your results in a table with mean, standard
deviation, p-values or any other useful information you think is necessary. Discuss the
2. Divide the data into a training and test set randomly. Mention the seed used for the
random sampling so that the experiment can be reproduced. You must also summarise
the number of red and white wine samples in your training and test set.
3. Compare different classification methods including logistic regression, linear discriminant
analysis, classification trees and any other classification methods of your choice. Discuss
the methods used with selection of any tuning parameters. If cross-validation is used,
specify exactly what you have done.
4. Write a concluding section with your comments about how you can classify different wines
using their physicochemical characteristics and if there are any particular variables which
are more important than the others.
You must present the answer as a well-written report with possibly different section head
ings. Marking scheme is as follows:
• Introduction: [20 Marks] Description of the variables, the objective of your study and
what are you going to do in the rest of the report.
• Methods and Analysis: [50 Marks] Data analysis with different classification tech
• Conclusion: [10 Marks] Concluding remarks and discussion.
• Presentation: [10 Marks] Overall presentation with tables and plots.
• R Codes: [10 marks] Correctness and reproducibility of the results are important.
EasyDue™ 支持PayPal, AliPay, WechatPay, Taobao等各种付款方式!
E-mail: firstname.lastname@example.org 微信:easydue