Credit-card payment default prediction using BRT

Keep it confidential in this class

# data are available here https://raw.githubusercontent.com/mitdbg/modeldb/master/data/credit-default.csv
# if it is no longer available, let me know, I have downloaded it.

# Previous work has illustrated how logistic regression can be used in this dataset.

# https://medium.com/@guaisang/credit-default-prediction-with-logistic-regression-b5bd89f2799f

x<-read.csv(‘https://raw.githubusercontent.com/mitdbg/modeldb/master/data/credit-default.csv’,skip=1)

You are recommended to review the boosted regression tree on an ecology example first.
https://cran.r-project.org/web/packages/dismo/vignettes/brt.pdf

Here let us try a financial problem. Let us try to machine-learn and predict whether a particular individual will default on their credit card payment in the following month given some of attributes of the person.

In the given dataset, the 25th column, ‘default payment next month’, Yes=1, No=1, is the response.
There are 24 other columns:
id: person ID
limit_bal: limit of the given credit
sex: 1 male and 2 female
education: 1 = graduate school; 2 = university; 3 = high school; 4 = others.
marriage: Marital status (1 = married; 2 = single; 3 = others)
age: Age (year)
pay_0 to pay6: History of repayment in past six month
* -1 = pay duly; 1 = payment delay for one month;
* 2 = payment delay for two months; . . .;
* 8 = payment delay for eight months;
* 9 = payment delay for nine months and above.
bill_amt1 to bill_amt6: bill statement in past six months
pay_amt1 to pay_amt6: payment made in past six months

The data file contains 30000 records (lines), using 50% records as training data, and the rest 50% as testing data.
How the prediction accuracy is defined in general? What is the prediction accuracy in this example? What is your evaluation of the performance of the boosted regression tree method? What factors are most important in this exercise? Write your answer in the comment box in the blackboard submission.

z<-x
str(z)
# force the age, education and marriage into factor style, since numbers in these three columns are not numeric for(i in 3:5) z[,i]<-as.factor(z[,i])
str(z)

# one needs to install the following two R packages first by install.packages command

require(gbm)
require(dismo)
# training the BRT model
i<-sample(1:30000,15000)
z.tc5.lr01 <- gbm.step(data=z[i,], gbm.x = 2:24, gbm.y = 25,family = “bernoulli”, tree.complexity = 5,learning.rate = 0.01, bag.fraction = 0.5)
# make prediction
preds <- predict.gbm(z.tc5.lr01, z[-i,],n.trees=z.tc5.lr01$gbm.call$best.trees, type=”response”)
# convert to binary
p<-ifelse(preds>0.5,1,0)
# calculate the prediction accuracy
mean(p==z[-i,25])

使用 BRT 的信用卡支付违约预测

在课堂上保密

# 数据在这里可用 https://raw.githubusercontent.com/mitdbg/modeldb/master/data/credit-default.csv
# 如果它不再可用,请告诉我,我已经下载了。

# 以前的工作已经说明了如何在此数据集中使用逻辑回归。

# https://medium.com/@guaisang/credit-default-prediction-with-logistic-regression-b5bd89f2799f

x<-read.csv(‘https://raw.githubusercontent.com/mitdbg/modeldb/master/data/credit-default.csv’,skip=1)

下面让我们试一个财务问题。 让我们尝试机器学习并预测特定个人是否会在下个月拖欠其信用卡付款,并给出该人的某些属性。