Problem 1 KNN
In this problem, we will build a KNN classifier and use cross-validation to tune K.
(a) The file knnClassify3CTrain.txt contains 200 rows and 3 columns (separated by a space).
The first 2 columns contain the input features, the last column contains the class label. Load
this file into Python and split them into X_train and y_train. Similarly convert
knnClassify3CTest.txt into X_test and y_test.
(b) Write your own KNN prediction function def knnClassifier(Xtrain, ytrain, Xtest, K). Use
Euclidean distance to find neighbors. Do NOT use KNeighborsClassifier() from sklearn
(c) Train you model in (b) with K=[1, 3, 5, 10, 15, 25, 50, 80]. Plot test accuracy with regard
to different K. For each K, plot decision boundary together with the training instances. What
can you find from these plots? Are the models with small K simpler or more complex? Do
they have high bias or high variance?
(d) Use KNeighborsClassifier and GridSearchCV methods in sklearn package to tune the
hyperparameter K. Plot validation accuracy curve with regard to K and find the K with
highest validation accuracy.
Problem 2 Classify high energy Gamma particles in atmosphere with Random
In this problem we will help classify high energy Gamma particles in atmosphere using
random forest algorithm.
The data are generated to simulate registration of high energy gamma particles in a ground-
based atmospheric Cherenkov gamma telescope using the imaging technique. Cherenkov
gamma telescope observes high energy gamma rays, taking advantage of the radiation
emitted by charged particles produced inside the electromagnetic showers initiated by the
gammas, and developing in the atmosphere. Depending on the energy of the primary
gamma, a total of about 10000 Cherenkov photons get collected, in patterns (called the
shower image), allowing to discriminate statistically those caused by primary gammas
(signal) from the images of hadronic showers initiated by cosmic rays in the upper
(a) Load “telescope_data.csv” dataset. The first ten columns are features and the last one is
class label, “g” for gamma (signal) and “h” for hadron (background). Use LabelEncoder()
from sklearn to encode the class label. Split the dataset into training and test dataset.
(b) Use GridSearchCV to tune hyperparameters, for example, number of trees in the forest
and the maximum depth of the tree.
(c) Use test dataset to evaluate your model with confusion matrix.
EasyDue™ 支持PayPal, AliPay, WechatPay, Taobao等各种付款方式!
E-mail: firstname.lastname@example.org 微信:easydue