FIT3152 Data analytics: Assignment 2

Objective:
The objective of this assignment is to gain familiarity with classification models using R.

You will be using a modified version of the Kaggle competition data: Predict next-day rain in
Australia. https://www.kaggle.com/jsphyg/weather-dataset-rattle-package, but predicting
whether or not the following day will be cloudy. The data contains a number of
meteorological observations as attributes, and the class attribute “CloudTomorrow”. Details
of the decision attributes follow the assignment description.

You are expected to use R for your analysis, and may use any R package. Clear your
workspace, set the number of significant digits to a sensible value, and use ‘WAUS’ as the
default data frame name for the whole data set. Read your data into R using the following
code:

rm(list = ls())
L <- as.data.frame(c(1:49))
set.seed(88888888) # Your Student ID is the random seed
L <- L[sample(nrow(L), 10, replace = FALSE),] # sample 10 locations
WAUS <- WAUS[(WAUS\$Location %in% L),]
WAUS <- WAUS[sample(nrow(WAUS), 2000, replace = FALSE),] # sample 2000 rows

We want to obtain a model that may be used to predict whether it is going to be cloudy
tomorrow for 10 locations in Australia.

Assignment questions:

1. Explore the data: What is the proportion of cloudy days to clear days.? Obtain
descriptions of the predictor (independent) variables – mean, standard deviations,
etc. for real-valued attributes. Is there anything noteworthy in the data? Are there
any attributes you need to consider omitting from your analysis? (1 Mark)

2. Document any pre-processing required to make the data set suitable for the model
fitting that follows. (1 Mark)

3. Divide your data into a 70% training and 30% test set by adapting the following
code (written for the iris data). Use your student ID as the random seed.

set.seed(XXXXXXXX) #Student ID as random seed
train.row = sample(1:nrow(iris), 0.7*nrow(iris))
iris.train = iris[train.row,]
iris.test = iris[-train.row,]

4. Implement a classification model using each of the following techniques. For this
question you may use each of the R functions at their default settings, or with minor
adjustments to set factors etc. (5 Marks)

• Decision Tree • Naïve Bayes • Bagging • Boosting • Random Forest

5. Using the test data, classify each of the test cases as ‘cloudy tomorrow’ or ‘not
cloudy tomorrow’. Create a confusion matrix and report the accuracy of each model.
(1 Mark)

6. Using the test data, calculate the confidence of predicting ‘cloudy tomorrow’ for
each case and construct an ROC curve for each classifier. You should be able to plot
all the curves on the same axis. Use a different colour for each classifier. Calculate
the AUC for each classifier. (1 Mark)

7. Create a table comparing the results in Parts 5 and 6 for all classifiers. Is there a
single “best” classifier? (1 Mark)

8. Examining each of the models, determine the most important variables in predicting
whether or not it will rain tomorrow. Which variables could be omitted from the data
with very little effect on performance? Give reasons. (2 Marks)

9. Starting with one or some of the classifiers you created in Part 4, create a classifier
that is simple enough for a person to be able to classify whether it will be cloudy or
not tomorrow by hand. Describe your model, either with a diagram or written
explanation. How well does your model perform, and how does it compare to those
in Part 4? What factors were important in your decision and why you chose the
attributes you used. (2 Marks) EasyDue™ 支持PayPal, AliPay, WechatPay, Taobao等各种付款方式!

E-mail: easydue@outlook.com  微信:easydue

EasyDue™是一个服务全球中国留学生的专业代写公司