R语言数据挖掘Data Mining关于逻辑回归和决策树模型
Problem Description
A bank has provided you with a dataset of mortgage loan customers collected over the past year. This dataset contains all the relevant information collected at the time of application for a total of 2000 customers. In total there are 14 explanatory variables and the class label variable indicating whether a customer proved to be good or bad. A bad customer is one that has missed three or more payments during the first year of the mortgage. Due to the nature of the problem identifying whether a customer will be bad is very important for the bank because the loss from each bad customer is on average ten times larger than the profit from a good customer. So even if the number of bad customers in the dataset is relatively small they have a large impact on profits.
Your task is to undertake a thorough investigation of this dataset; to consider Logistic Regression and Decision Tree models; and to finally recommend the most appropriate model model to identify customers with a high risk of being bad. The project objectives set by the bank are as follows:
• The bank faces a trade-off between accepting customers so that it retains its share in the mortgage loan market and incurring losses due to providing loans to customers that default. The bank managers understand that no model can perfectly identify all ‘bad’ customers, but using your recommended model they want you to answer the following questions:
– What is the maximum proportion of good customers that can be granted loans while ensuring that x% of the bad customers are correctly identified.
– What is the maximum proportion of the overall population that can be granted loans while ensuring that x% of the bad customers are correctly identified.
They would like to know the answer to these questions for the following values of x: 50, 75, and 90.
• They are also interested in understanding how the model you propose makes the final prediction to classify a customer as good or bad. In particular, what are the most important variables in this prediction and what is the role of each variable?
• At the end of your analysis you need to recommend one classification model for this problem and comment on its performance. Also comment on the contribution of each variable in the recommended model.
Data Description
You are provided with a sample of observations which contain information about past bank customers. The main variables in this dataset are described in Table 1. The class variable (i.e. the variable we want to predict) is called Good Bad. There is a total of 14 explanatory variables in this dataset.
Tasks
• Exploratory Data Analysis (40 marks).
Consider each variable and answer the following indicative list of questions (Please note that the list below is by no means exhaustive but is instead aimed to get you thinking about how to assess the importance of different variables):
– Does this variable appear to be important for the task at hand? (After discussing each variable separately provide a ranking of the importance of all explanatory variables.) Support your claims with appropriate visualisations that document whether and how important each variable is.
– Are different variables related, and which variables convey information similar to that provided in other variable(s)?
Table 1: Data Description
Annual Income (interval)
Annual Gross Income in £s
Credit History (interval)
Loan applications in past five years
Credit Cards (interval)
Credit cards currently held
Amount (interval)
Loan amount
Number of Dependants (interval)
Employment (nominal)
1 Other
2 Self Employment
3 Part time
4 Full time private sector
5 Full time public sector
Installment Percentage (interval)
Monthly installment as percentage of monthly gross earnings
Time at Current Employment (interval)
in years
Time at Address (interval)
in years
Age (interval)
in years
Delayed or Missed Payments (ordinal)
0 No missed/delayed payments over last 3 years
1 Delayed payments only over last 3 years
2 Missed payments over last 3 years
Residential Status (nominal)
1 Rent
2 Own
3 Live with Family
Existing Credits (interval)
Additional lines of credits
Area indicator (nominal)
Location of branch receiving application
Good Bad (Target variable)
0 Good customer
1 Bad customer
– Do you find evidence of “outliers” or other issues with data quality (e.g. incorrect observations)? • Statistical Modelling (60 marks)
– What is the appropriate performance measure for this application and why? Relate this to the project objectives.
– For the two types of classifiers: logistic regression, and decision trees discuss different settings you used and why you considered these important. (Consider the choice of variable selection method as part of this question also.)
– For each classification method develop one or a few candidate models that you think are promising before providing a final recommendation of the most appropriate model (for each question in the project objectives section). You do not need to discuss every model you tried in detail, but you must include the results for the important steps in the process that led you to the final recommendations. I am particularly interested in understanding the steps you followed and the justification for these. (Refer to the CRISP data mining process discussed during the lectures and in Chapter 1 of the Guide to Intelligent Data Analysis).
– Comment on the generalisation performance of the model(s) you recommend for each type of classifier.
The coursework requires you to write a report explaining your findings. This means that you need to explain each figure, table or number you include in the report (these need to have appropriate captions and numbers so that they can be referenced in your report). Including a relevant figure, or table, or screenshot from RStudio, but not explaining what are the conclusions from these will get you no marks.
• You do not need to write an executive summary, or include a cover page, and a page of contents.
• You do need to include at the end of your coursework a Conclusions section which will summarise your findings and will clearly answer the questions posed in the project objectives section. In this section I would also recommend to discuss the relative advantages and limitations of the two types of classifiers for the problem at hand (and not only in abstract terms).
Please read the next Section carefully to avoid misunderstandings
Report Assessment
Your coursework will not be evaluated by the quality of the final model alone, or by whether you got a particular answer right. You will be primarily assessed by whether you are able to correctly justify the steps you took to complete the assignment. In other words, your report needs to document that you are able to intelligently analyse the provided data, that you draw correct conclusions from what you observe, and that these conclusions lead you either to the next logical step of the data mining process, or to the revision of decisions made in previous steps of the analysis. (Refer to the flowchart of data mining stages we covered in the first lectures and in particular to the feedback loops)
Therefore, don’t simply present the conclusions/ results of your analysis and expect to get a high mark. Reports that don’t document the steps followed and the reasons why these were chosen will receive minimal marks, even if the final answer is sensible. Explain your reasoning clearly and in good English. Don’t provide a list of bullet points, or unstructured sentences etc. Similarly, don’t include figures or any other output from R that you don’t comment/ explain in the text. I will not assume that you know how to interpret these correctly.
Do not include information about how you achieved particular tasks using R (or any other software). This is not a course on R programming, so I am not interested in the commands you used or what the output looks like.