Credit-card payment default prediction using BRT

Keep it confidential in this class

# data are available here https://raw.githubusercontent.com/mitdbg/modeldb/master/data/credit-default.csv
# if it is no longer available, let me know, I have downloaded it.

# Previous work has illustrated how logistic regression can be used in this dataset.

# https://medium.com/@guaisang/credit-default-prediction-with-logistic-regression-b5bd89f2799f

x<-read.csv(‘https://raw.githubusercontent.com/mitdbg/modeldb/master/data/credit-default.csv’,skip=1)

You are recommended to review the boosted regression tree on an ecology example first.
https://cran.r-project.org/web/packages/dismo/vignettes/brt.pdf

Here let us try a financial problem. Let us try to machine-learn and predict whether a particular individual
will default on their credit card payment in the following month given some of attributes of the person.

In the given dataset, the 25th column, ‘default payment next month’, Yes=1, No=1, is the response.
There are 24 other columns:
id: person ID
limit_bal: limit of the given credit
sex: 1 male and 2 female
education: 1 = graduate school; 2 = university; 3 = high school; 4 = others.
marriage: Marital status (1 = married; 2 = single; 3 = others)
age: Age (year)
pay_0 to pay6: History of repayment in past six month
* -1 = pay duly; 1 = payment delay for one month;
* 2 = payment delay for two months; . . .;
* 8 = payment delay for eight months;
* 9 = payment delay for nine months and above.
bill_amt1 to bill_amt6: bill statement in past six months
pay_amt1 to pay_amt6: payment made in past six months

The data file contains 30000 records (lines), using 50% records as training data, and the rest 50% as testing data.
How the prediction accuracy is defined in general? What is the prediction accuracy in this example? What is your evaluation of the performance of
the boosted regression tree method? What factors are most important in this exercise? Write your answer in the comment box in the blackboard submission.

z<-x
str(z)
# force the age, education and marriage into factor style, since numbers in these three columns are not numeric
for(i in 3:5) z[,i]<-as.factor(z[,i])
str(z)

# one needs to install the following two R packages first by install.packages command

require(gbm)
require(dismo)
# training the BRT model
i<-sample(1:30000,15000)
z.tc5.lr01 <- gbm.step(data=z[i,], gbm.x = 2:24, gbm.y = 25,family = “bernoulli”, tree.complexity = 5,learning.rate = 0.01, bag.fraction = 0.5)
# make prediction
preds <- predict.gbm(z.tc5.lr01, z[-i,],n.trees=z.tc5.lr01$gbm.call$best.trees, type=”response”)
# convert to binary
p<-ifelse(preds>0.5,1,0)
# calculate the prediction accuracy
mean(p==z[-i,25])