- 标题：Improving the art, craft and science of economic credit risk scorecards using random forests: why credit scorers and economists should use random forests.
- 作者：Sharma, Dhruv
- 期刊名称：Academy of Banking Studies Journal
- 印刷版ISSN：1939-2230
- 出版年度：2012
- 期号：January
- 语种：English
- 出版社：The DreamCatchers Group, LLC
- 摘要：The aim of this paper is to outline an approach to improving credit risk scorecards using Random Forests. We start with the benefits of random forests compared to logistic regression, the tool used most often for credit scoring systems. We then compare performance of random forests and logistic regression out of the box on a credit card dataset, a home equity loan dataset and a proprietary data set. We outline an approach to improving logistic regression using the random forest. We conclude by demonstrating how power random forests can be used to develop a model using 8 variables which is almost as good as the FICO[R] score. Thus highlighting the fact that data sets with complex interaction terms and contents can benefit from random forest models in 2 ways: 1) clear insight into the most predictive and valuable variables 2) generating robust models which maximize predictive interactions and relationships in the data not detectable by traditional regression techniques.
- 关键词：Algorithms;Credit ratings;Software;United States economic conditions

**Improving the art, craft and science of economic credit risk scorecards using random forests: why credit scorers and economists should use random forests.**

Sharma, Dhruv

INTRODUCTION

The aim of this paper is to outline an approach to improving credit risk scorecards using Random Forests. We start with the benefits of random forests compared to logistic regression, the tool used most often for credit scoring systems. We then compare performance of random forests and logistic regression out of the box on a credit card dataset, a home equity loan dataset and a proprietary data set. We outline an approach to improving logistic regression using the random forest. We conclude by demonstrating how power random forests can be used to develop a model using 8 variables which is almost as good as the FICO[R] score. Thus highlighting the fact that data sets with complex interaction terms and contents can benefit from random forest models in 2 ways: 1) clear insight into the most predictive and valuable variables 2) generating robust models which maximize predictive interactions and relationships in the data not detectable by traditional regression techniques.

For the purpose of this study, model performance will be compared using Receiver Operating curves which plot the proportion of bad loans detected vs. incorrectly classified good loans for each model cut off. Numerically this will be represented by the area under the curve of the ROC plot. All performance discussed will be out of sample performance of a 30% hold out sample while the models generated are built on 70% of the dataset. All investigations into data are conducted using R and Rattle tool.

TRADITIONAL CREDIT SCORING PITFALLS

The biggest problem with traditional credit scoring based on logistic regression techniques is that as a scientist or economist one cannot interpret the importance of underlying variables to the probability of a borrower experiencing financial difficulty.

The p values of the regression are not reliable as regression assumes no multicollinearity. As such variables which might make sense from a theoretical point of view, such as cash flow surrogates, and may have strong predictive power would not appear to be statistically significant based on p value statistics. This is a problem because credit data is notoriously correlated and biased. It is well known that 'biased estimation in data ... [which] has been shown to predict and extrapolate better when predictor variables are highly correlated ...' as this is common to credit scoring (Overstreet, 1992) .

Although modelers have used skill and judgment to work past this short coming there is no way in traditional scorecards to assess the predictive value variables in a robust and reliable manner. Thus there might be many opportunities of variables and variable interactions which might be lost given the use of the current tool.

Also from a human factors and organizational point of view people are biased to test theories they have and not try things that might not make sense. Our ability to develop causal models is biased and arbitrary despite the meanings we attach to things after the fact.

The history of credit scoring literature is rife with contradictory studies from the Durand's first study in the 1930s on whether income is predictive. Yet mortgage risk models have shown the debt ratio (monthly expenses/income) to be predictive as well as month's reserves (liquid assets/monthly payment). The successes of credit scoring in the mortgage industry show that financial worth and ability to pay variables can be used effectively in models along with loan to value (loan amount/property value) to assess risk. If we step back we can see that interaction variables of affordability and credit risk have proven to be valuable predictive tools. This is also consistent with the judgment theory of credit of: credit (willingness to pay), capacity (ability to pay) and collateral, and character.

The next leap in improvement to credit scoring is to find ways to test interaction terms in a meaningful and principled way. It stands to reason econometrically that if any variable should have impact on human behavior in spending, consumption, and financial distress it should be ability to pay. The measures of this are income, current debt usage, and reserves and assets one has saved to absorb shocks or life events.

Is there a statistically reliable way to test out the importance of variables, relative to their predictive power?

Importance of Random Forests to Credit Risk and Economics in general

To date the majority of credit scorecards used in industry are linear models despite the known issues of the flat maximum and multicollinearity (Wainer, 1978; Overstreet etal 1997;). Random Forests are a powerful tool for economic science as they are able to successfully deal with correlated variables with complex interactions (Breiman, 2001).

A simple example of the power of Random Forests was shown by Breiman in the binary prediction case of hepatitis mortality in which Stanford medical school had identified variables 6, 12, 14 and 19 as most predictive of risk using logistic regression. Subsequently using the bootstrap technique Efron showed that none of these variables were significant in the random resampling trials he ran. The Random Forest variable importance measure, created by Breiman, showed variables 7 and 11 to be critical and improved the logit regression results simplifying the model and by reducing error from 17% to 12% (Breiman, 2002).

As Random Forests are non parametric the linear restrictions of the flat maximum do not come into play as such. That said predictive models tend to perform well with regards to pareto optimal trade offs in true positive and false positive rates which look like an asymptote like the flat maximum effect. The complex interactions of economic variables such as macroeconomic forces and affordability are too complex to be studied for simple linear regression anymore. Random Forests serve as good estimate for asymptote of possible predictive power in this regards and help us get past the psychological limit we may believe to exist for predictive power as Roger Banister was able to do with preconceived limit on minimum time for completing the mile run. The way Random Forests work by building large quantities of weak classifiers with random selection of variables grown with out of sample testing is analogous to the way humans make decisions in a market place (See Gigerenzer's work on "Fast and Frugal trees" on human judgment models). Humans each look at the data available to them and make quick inferences and take actions based on these data. Random Forests then take votes from these large quantities of predictors and use decisions of all the predictors to make the final decision. The fact that diverse models built on different variables and samples of data when combined outperform other simple linear models is profound.

That said the critical aspects of Random Forests of interest to economic scientists are the features Breiman intended such as:

* Random Forests never overfit the data as they are built with out of sample testing for each submodel

* Variable importance ( a measure based on the importance in accuracy each variable provides to the overall model based on permutation tests of removing variables)

* Being able to see the effects of variables on predictions (2002).

* Handling thousands of variables efficiently by sampling variables.

Random Forests help us see the true impact of complex interrelated variables. As Breiman mentioned in his Wald lecture, complex phenomenon cannot be modeled well with goodness of fit models with simplifications. A more scientific approach is to build as complex a model to fit the phenomenon being studied and then to have tools like variable importance to understand the relationship inside the phenomenon being studied (Breiman, 2002). This is an important point as economics is based more and more complex realities.

Comparison of Random Forests to Logistic Regression

We now examine random forest performance out of the box on 3 data sets. The first dataset is a private label credit card data set from the 2010 KDD contest in Pacific Asia, the second data set is the widely used home equity loans, and the third data is a proprietary dataset.

1. Random Forest vs. Logistic Regression on Credit Card Data Set

Credit Card Dataset

The credit card data set has 50,000 loans of which 13000 are bad (serious delinquency). Using this data set a random forest model and logistic regression scorecard were compared out of the box. The source for the data is http://sede.neurotech.com.br/PAKDD2010/ Pacific-Asian Knowledge Discovery and Data Mining conference.

Models

Random Forest Variable Importance: The variable importance plot for random forests showed the following variables to be predictive in rank order.

According to the random forest plot the majority of predictions of borrower delinquency on the card can be predicted by age, monthly income, phone, payment day, type of occupation, marital status, number of dependents, area code of profession, and type of residence. In addition additional variables can add to predictive power in some fashion through some interaction effects.

Logistic regression model

Insights

Note how the regression makes the personal income appear statistically insignificant although we know from the random forest that it has a great deal of predictive power.

[GRAPHIC OMITTED]

The AUC (area under the curve) for the random forest model was .629 while for the regression model was .60. Thus random forests had a 5% improvement in performance over the logistic regression. By adding interaction terms suggested by variables in the random forest the logistic regression performance can be enhanced to match or slightly exceed random forest performance.

2. Random Forest vs. Logistic Regression on Home Equity Data Set Home Equity Dataset

The home equity data set has approximately 5,960 loans of which 1189 are bad (serious delinquency). Using this data set a random forest model and logistic regression scorecard were compared out of the box. The source for the data is the popular SAS data set: www.sasenterpriseminer.com/data/HMEQ.xls

Models

Random Forest Variable Importance: The variable importance plot for random forests showed the following variables to be predictive in rank order. The debt ratio, age of credit history, value of the home, and delinquency history had the most predictive power according to the random forest.

[GRAPHIC OMITTED]

Insights

The regression shows Debt ratio and other variables suggested by random forests to be statistically significant.

[GRAPHIC OMITTED]

The random forest however greatly outperforms the logistic regression scorecard on the home equity data set. Thus showing that logistic regression is not exploiting the maximum predictive value of the variables.

The AUC of the random forest was .92 while for the logistic regression was .78. Thus out of the box random forests had an 18% advantage in performance over the logistic regression. A recent study of tuning logistic regressions with neural network transformations had a performance of logistic regression to have an AUC of .86 (Wallinga, 2009). Thus Wallinga's approach of general additive neural network logistic regression though a powerful well thought enhancement improved performance by 28% but still did not match the out of performance of random forests.

3. Random Forest vs. Logistic Regression on Proprietary Data set Proprietary Dataset

The proprietary data set comprises of credit data from 2008 and the bad loans are those defined as loans which go 90 days past due or worse within 2 years on any account tradeline or loan. The data has 293,421 loan applicants and 19,449 bad loans.

Models

Random Forest Variable Importance: The variable importance plot for random forests showed the following variables to be predictive in rank order.

The revolving line of credit utilization, debt ratio, income, age of applicant, number of 30 days delinquencies in 2 years, number of tradelines active/open (had activity within 6 months), number of 90 day delinquency tradelines in 2 years, number of 60 day tradelines in 2 years, and number of mortgage tradelines have the most predictive power in predicting serious delinquency for a borrower for up 2 years. The attributes excluded duplicate or invalid status tradelines.

[GRAPHIC OMITTED]

Insights

Regression does not show revolving utilization to be statistically significant while random forests correctly identify it as a very predictive variable and obtain maximal predictive value from the data.

Performance

Using these 8 variables the AUC of the random forest exceed that of logistic regression by a large margin. Random forest has an area under the curve of 0.8522 while logistic regression has an AUC of 0.6964.

[GRAPHIC OMITTED]

In addition results of the performance were also computed for a popular credit score known as FICO[R]. Performance of the credit score was superior to both regression and random forest as it had an AUC of .865.

[GRAPHIC OMITTED]

IMPLICATIONS

The fact that random forests with 8 variables can produce a model which is competitive with FICO[R] out of the box is remarkable. Logistic regression does not achieve that level of performance out of the box.

This example clearly shows random forest's superiority in scientifically rank ordering predictive variables and optimally extracting predictive value from data with multi-collinearity and interactions. The advantage of random forests depends on strength of relationships between variables. In data sets with little interaction effects random forests may not outperform. On large credit data sets, behavioral models, application scoring random forests can improve existing credit models by 5-10% by tuning regression. Once tuned logistic regression can outperform random forests with judgment and careful testing of logistic regression. The example of building a random forest that is almost as predictive as a FICO[R] score, with an AUC of .85 vs. .865, but with 8 variables dramatically shows the power of random forests for scientists and credit risk modelers to maximize predictive value of data using random forests.

All 8 variables conform to theoretical soundness as they relate to borrower cash flow surrogates. Econometrically credit scoring variables can be segmented into: cash flow variables, stability variables, and payment history variables (Overstreet, 1992). Removing the revolving utilization and delinquency behavior variables greatly reduced the random forest performance to be more in line with logistic regression. Implying that the most predictive value is in the interaction of the utilization and delinquency behavior attributes with the other variables. Random forests will outperform when there are complex relationships and interactions between the variables a typical regression might miss.

Explaining the Advantage of Random Forests over Logistic Regression

An explanation of how such a simple data set can be competitive with the FICO[R] is the fact the credit models are thought to suffer from the flat maximum effect which implies that models with smaller data can perform close to larger more sophisticated linear models like logistic regression because these regressors are insensitive to large variations in the size of regression weights. Random forest advantage also seems to correlate with variables with interaction effects and multi-collinearity as the technique is able to determine complex relationships in the data using a bootstrap of variables and samples to build ensembles of models.

The power of random forests has profound implications for taking credit risk scorecards to the next level by optimizing credit score performance and leading to better and more robust scientific inferences about factors and how they impact phenomenon ranging from financial risk to consumer behavior modeling to medical science and perhaps even mimicking know humans think or behave in swarm intelligence.

Optimizing Credit Scorecards Using Random Forests: An approach

Updated Credit Card Random Forest Variable Importance with interaction terms

Main stream credit scorers can benefit from random forest models as well. One approach to optimizing existing models is to test interaction terms with variables identified to be most predictive by random forests. For example using the credit card data set discussed initially one can improve the AUC of the logistic regression to match random forests by adding interaction terms to the credit card data set to achieve an AUC of .626. Thus logistic regression can be tuned to match performance of random forests out of the box and yield almost the same performance as the random forest model (and on some data sets after tuning logistic regression performs better than random forest).

Overall process for Optimizing Existing Credit Scorecard

* SOAR (Specify data, observe data, analyze, and recommend) (Brown, 2005)

* Run Random Forest

* Take top predictive fields and create interactions terms with regression one at a time and retain statistically significant interactions

* Rerun regression and compare until regression outperforms or closely matches random forest out of sample performance

* Run conditional inference trees to identify interactions and re-run random forest and logit models until maximal performance is achieved.

* Convert fields to factors for logit as binned data improves logit in general

* Multiply the score from Random forest and logistic, sum, take max, and compare area under curve. As predicted Hand's Superscorecard literature multiplying the 2 scores resulted in improved performance as well (Hand etal, 2002).

The method of using random forests, affordability and logistic regression in combination with conditional inference trees iteratively to improve logistic regression to match and outperform random forests is dubbed the Sharma method. For the most comprehensive review of credit scoring literature and this approach see (Sharma, Overstreet & Beling, 2009). Also the methods are detailed in the Guide to Credit Scoring in R as well (Sharma, 2009). The pioneering work behind this was Overstreet etal in 1992 which was the first theory based free cash flow model for credit scoring and Breiman's work on random forests which allowed the importance of affordability data to be more clearly seen. Prior to this most logistic regression scorecards showed income and cash flow data to be marginally predictive as the p values were too high and erroneous due to multicollinearity. For details on checkered history of credit scoring see Sharma, Overstreet and Beling, 2009.

In terms of implementation R was used along with Rattle data mining software. Rattle greatly facilitated the speed and ease of running the algorithms and credit scoring once the interaction terms were added by hand code and run through rattle (See Graham for Rattle, 2008).

Extensions

In large data sets I have been able to improve logistic regressions to match the performance of random forests using trial and error, judgment and using random forest variable importance as a base to add interaction terms. This approach is painful, and time consuming. A more viable approach will be to use random forest performance as a benchmark to automatically optimize logistic regression using out of sample error by testing out interactions among most predictive variables and formulas using a genetic algorithm approach.

Credit scoring is a search for meaningful interaction terms and all financial ratios are interaction terms. Hand has shown multiplying scores always produce a better or equivalent score, and this itself is again an example of interaction term of multiplying variables (Hand, 2002). By viewing financial ratios as interactions one can widen the lens and search for optimal interactions to obtain optimal predictive power from the affordability data. Traditional regression, with it's failure to handle multi-collinearity, has made searching for fruitful interaction terms in credit data problematic. Also attempting too many interactions can overfit logits. Thus, a careful knowledge based approach is needed which random forest variable importance measures provide. For an in depth discussion of this, as well as the most comprehensive literature review of credit scoring, and the overall approach see Sharma, Overstreet and Beling, 2009.

CONCLUSIONS

The best of both worlds can be achieved by finding ways to optimally enhance logistic regression using insights from random forest variable importance which are more reliable gauges for variable importance and relationship given the multi-collinearity in all credit models and data. To date, the random forests I have tuned logistic regressions scorecards judgmentally using random forest variable importance to outline interactions terms to be added to the model but the home equity dataset shows that this might not be enough as more transformations and binning of variables might be needed to optimally squeeze performance into logistic regression to explore interaction terms and transformations via stochastic search optimization using genetic algorithms within a bounded variable space using random forest performance as a stopping criterion. This would best be accomplished via an automated algorithm which iterates through variable interaction and combination mining using a sample set of meaningful variables identified by random forest as being predictive which regression p values might miss. A common example of this oversight by traditional scorecards since the time of Durand in the 1930s is that of income and affordability data which standard regressions have shown to not be predictive while flying in the face of common sense. The most successful predictive variables using the mortgage industry are all interaction terms (loan to value, month's reserves, and debt ratio; for example of mortgage scoring see Avery et al 1996). The history of credit scoring shows finding optimal interaction terms is crucial to optimal predictive accuracy and random forests play a vital role in being able to test out meaningful variables which traditional scoring technologies such as regression failed to identify using p value tests of significance.

Human Values perspective

Credit scoring should be integrated with normative models to ensure borrower wellbeing instead of maximizing profit as evidenced by the recent global recession in the 21st century. Credit score models no matter how sophisticated built to predict two years of data fail to assess the long term impact of borrower wellbeing and that is a challenge worth studying; such knowledge will surely lead to sustainable credit markets which do not threaten democracy and have a robust micro-foundation for macro-markets in credit. In the aggregate picture proprietary models to predict behavior are all more suboptimal than a white box credit policy which ensures borrower financial wellbeing by ensuring constraints on borrower reserves, consumption, and expenses to income over time. Competition in credit modeling will not lead to better consumer welfare as credit is a commodity and financial institutions should not compete on credit policy for sustainable advantage but instead should compete on convenience, safer products, and customization to fit borrower life stages.

Let's hope in the future we won't need proprietary models and can live in an enlightened world where borrowers can choose safe products and know the implications of their behavior on their ability to obtain more credit in a open white box world where behavior is then regulated by a desire to conform to standards which will make the borrowers more fiscally responsible. Credit data should be democratized and not for profit entities as it is a social good.

APPENDIX OF DATA DESCRIPTIONS AND OPEN DATA SETS

Credit Card Dataset Original Variable Descriptions Var_Title Var_Description Field_Content ID_CLIENT Sequential number for 1-50000, the applicant (to be 50001-70000, 70001- used as a key) 90000 CLERK_TYPE Not informed C PAYMENT_DAY Day of the month for 1 ,5,10,15,20,25 bill payment, chosen by the applicant APPLICATION_ Indicates if the Web, Carga SUBMISSION_TYPE application was submitted via the internet or in person/posted QUANT_ADDITIONAL_ Quantity of 1 ,2,NULL CARDS additional cards asked for in the same application form POSTAL_ADDRESS_TYPE Indicates if the 1.2 address for posting is the home address or other Encoding not informed. SEX M = Male, F = Female MARITAL_STATUS Encodinq not informed 1,2,3,4,5,6,7 QUANT_DEPENDANTS 0, 1, 2, ... EDUCATION_LEVEL Edducational level in 1,2,3,4,5 qradual order not informed STATE_OF_BIRTH Brazilian states, XX, missing CITY_OF_BIRTH NACIONALITY Country of birth 0, 1, 2 Encoding not informed but Brazil is likely to be equal 1 RESIDENCIAL_STATE State of residence RESIDENCIAL_CITY City of residence RESIDENCIAL_BOROUGH Borouqh of residence FLAG_RESIDENCIAL_ Indicates if the Y, N PHONE applicant possesses a home phone RESIDENCIAL_PHONE_ Three-digit pseudo- AREA_CODE code RESIDENCE_TYPE Encoding not 1,2,3,4,5,NULL informed. In general, there are the types: owned, mortqaqe. rented, parents, family etc. MONTHS_IN_RESIDENCE Time in the current 1,2, ... , NULL residence in months FLAG_MOBILE_PHONE Indicates if the Y,N applicant possesses a mobile phone FLAG_EMAIL Indicates if the 0.1 applicant possesses an e-mail address PERSONAL_MONTHLY_ Applicant's personal INCOME regular monthly income in Brazilian currency (R$) OTHER_INCOMES Applicant's other incomes monthly averaged in Brazilian currency (R$) FLAG_VISA Flaq indicatinq if 0.1 the applicant is a VISA credit card holder FLAG_MASTERCARD Flag indicating if 0.1 the applicant is a MASTERCARD credit card holder FLAG_DINERS Flaq indicatinq if 0 1 the applicant is a SINERS credit card holder FLAG_AMERICAN_ Flag indicating if 0.1 EXPRESS the applicant is an AMERICAN EXPRESS credit card holder FLAG_OTHER_CARDS Despite being label 0, 1, NULL "FLAG", this field presents three values not explained QUANT_BANKING_ 0, 1, 2 ACCOUNTS QUANT_SPECIAL_ 0, 1, 2 BANKING_ACCOUNTS PERSONAL_ASSETS_ Total value of the VALUE personal possessions such as houses, cars etc in Brazilian currency (R$) QUANT_CARS Quantity of cars the applicant possesses COMPANY If the applicant has Y,N supplied the name of the company where he/she formally works PROFESSIONAL_STATE State where the applicant works PROFESSIONAL_CITY City where the applicant works PROFESSIONAL_ Borough where the BOROUGH applicant works FLAG_PROFESSIONAL_ Indicates if the Y,N PHONE professional phone number was supplied PROFESSIONAL_PHONE_ Three-digit AREA_CODE pseudo-code MONTHS_IN_THE_JOB Time in the current residence in months PROFESSION_CODE Applicant's 1.2,3, ... profession code. Encodinq not informed OCCUPATION_TYPE Encodinq not informed 1 .2,3,4,5,NULL MATE_PROFESSION_ Mate's profession 1 ,2,3, ... CODE code. Encoding not informed EDUCATION_LEVEL Mate's educational 1 .2,3,4,5 level in qradual order not informed FLAG_HOME_ADDRESS_ Flag indicating 0.1 DOCUMENT documental confirmation of home address FLAG_RG Flaq indicatinq 0.1 documental confirmation of citizen card number FLAG_CPF Flaq indicatinq 0.1 documental confirmation of tax payer status FLAG_INCOME_PROOF Flaq indicating 0.1 documental confirmation of income PRODUCT Type of credit 1, 2, 7 product applied Encodinq not informed FLAG_ACSP_RECORD Flag indicating if Y, N the applicant has any previous credit delinquency AGE Applicant's aqe at the moment of submission RESIDENCIAL_ZIP_3 Three most significant diqits of the actual home zip code PROFESSIONAL_ZIP_3 Three most significant diqits of the actual home zip code TARGET_LABEL_ Target Variable: BAD BAD = 1 = 1 , GOOD = 0 Source: http://sede.neurotech.com.br/PAKDD2010/ HOME EQUITY DATA SET ORIGINAL VARIABLES Name Model Measurement Description Role Level BAD Target Binary 1 = defaulted on loan, 0 = paid back loan REASON Input Binary HomeImp = home improvement, DebtCon = debt consolidation JOB Input Nominal Six occupational categories LOAN Input Interval Amount of loan request MORTDUE Input Interval Amount due on existing mortgage VALUE Input Interval Value of current property DEBTINC Input Interval Debt-to-income ratio YOJ Input Interval Years at present job DEROG Input Interval Number of major derogatory reports CLNO Input Interval Number of trade lines DELINQ Input Interval Number of delinquent trade lines CLAGE Input Interval Age of oldest trade line in months NINQ Input Interval Number of recent credit inquiries Source: www.sasenterpriseminer.com/data/HMEQ.xls

APPENDIX OF R CODE

Credit Card Data Set and interactions

cc<-read.csv("C:/Documents and Settings//My Documents/cckdd2010.csv")

cc$TARGET_LABEL_BAD<-as.factor(cc$TARGET_LABEL_BAD)

cc$QUANT_DEPENDANTS<-ifelse(cc$QUANT_DEPENDANTS>= 13,13,cc$QUANT_ DEPENDANTS)

#cc$ZipDist<-as.numeric(cc$RESIDENCIAL_ZIP_3)- as.numeric(cc$PROFESSIONAL_ZIP_3)

#cc$StateDiff<-as.factor(ifelse(cc$RESIDENCIAL_STATE== cc$PROFESSIONAL_STATE,'Y','N'))

#cc$CityDiff<-as.factor(ifelse(cc$RESIDENCIAL_CITY==cc$PROFESSIONAL_CITY,' Y','N'))

#cc$BoroughDiff<-as.factor(ifelse(cc$RESIDENCIAL_BOROUGH=

cc$PROFESSIONAL_BOROUGH, 'Y','N'))

cc$MissingResidentialPhoneCode<

as.factor(ifelse(is.na(cc$RESIDENCIAL_PHONE_AREA_CODE)==TRUE,'Y','N'))

cc$MissingProfPhoneCode<-as.factor(ifelse(is.na (cc$PROFESSIONAL_PHONE_AREA_CODE)==TRUE,'Y','N'))

cc<-subset(cc,select=-ID_CLIENT)

cc<-subset(cc,select=-CLERK_TYPE)

cc<-subset(cc,select=-QUANT_ADDITIONAL_CARDS)

cc<-subset(cc,select=-EDUCATION_LEVEL)

#cc<-subset(cc,select=-STATE_OF_BIRTH)

cc<-subset(cc,select=-CITY_OF_BIRTH)

#cc<-subset(cc,select=-RESIDENCIAL_STATE)

cc<-subset(cc,select=-RESIDENCIAL_CITY)

cc<-subset(cc,select=-RESIDENCIAL_BOROUGH)

cc<-subset(cc,select=-PROFESSIONAL_STATE)

cc<-subset(cc,select=-PROFESSIONAL_CITY)

cc<-subset(cc,select=-PROFESSIONAL_BOROUGH)

cc<-subset(cc,select=-FLAG_MOBILE_PHONE)

cc<-subset(cc,select=-FLAG_HOME_ADDRESS_DOCUMENT)

cc<-subset(cc,select=-FLAG_RG)

cc<-subset(cc,select=-FLAG_CPF)

cc<-subset(cc,select=-FLAG_INCOME_PROOF)

cc<-subset(cc,select=-FLAG_ACSP_RECORD)

cc<-subset(cc,select=-TARGET_LABEL_BAD.1)

cc<-subset(cc,select=-RESIDENCIAL_ZIP_3)

cc$PROFESSIONAL_ZIP_3<-as.numeric(cc$PROFESSIONAL_ZIP_3)

cc$RESIDENCIAL_PHONE_AREA_CODE[is.na(cc$RESIDENCIAL_PHONE_AREA_CODE)] <- 0

cc$PROFESSIONAL_PHONE_AREA_CODE[is.na(cc$PROFESSIONAL_PHONE_AREA_CODE)] <- 0

cc$PROFESSION_CODE<-as.numeric(cc$PROFESSION_CODE)

cc$OCCUPATION_TYPE<-as.numeric(cc$OCCUPATION_TYPE)

cc$MATE_PROFESSION_CODE<-as.numeric(cc$MATE_PROFESSION_CODE)

cc$EDUCATION_LEVEL. 1<-as.numeric(cc$EDUCATION_LEVEL. 1)

cc$RESIDENCE_TYPE<-as.numeric(cc$RESIDENCE_TYPE)

cc$MONTHS_IN_RESIDENCE<-as.numeric(cc$MONTHS_IN_RESIDENCE)

cc$TotIncome<-cc$PERSONAL_MONTHLY_INCOME+cc$OTHER_INCOMES

cc$OthIncomePct<-cc$OTHER_INCOMES/cc$PERSONAL_MONTHLY_INCOME

cc$MnthsSavings<-cc$PERSONAL_ASSETS_VALUE/(.01+cc$MONTHS_IN_THE_JOB*cc$ TotIncome)

cc$Afford<-cc$TotIncome+cc$PERSONAL_ASSETS_VALUE

cc$IncomeToAssets<-cc$TotIncome/(cc$PERSONAL_ASSETS_VALUE+.01)

cc$i1<-cc$QUANT_DEPENDANTS*cc$AGE

cc$i2<-cc$AGE*cc$PROFESSIONAL_ZIP_3

cc$i4<-cc$PROFESSION_CODE*cc$AGE

cc$i5<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$AGE

cc$i6<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$PROFESSIONAL_PHONE_AREA_CODE

cc$i7<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$OthIncomePc

cc$i8<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$IncomeToAssets

cc$i9<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$i1

cc$i10<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$i2

cc$i11<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$i5

cc$i12<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$OTHER_INCOMES

cc$i13<-cc$QUANT_DEPENDANTS*cc$RESIDENCIAL_PHONE_AREA_CODE

cc$i14<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$RESIDENCE_TYPE

cc$i15<-cc$RESIDENCIAL_PHONE_AREA_CODE*cc$PROFESSIONAL_ZIP_3

cc$i16<-cc$PERSONAL_MONTHLY_INCOME*cc$PROFESSIONAL_ZIP_3

cc$i17<-cc$OTHER_INCOMES*cc$PROFESSIONAL_ZIP_3

cc$i18<-cc$PROFESSIONAL_ZIP_3*cc$IncomeToAssets

cc$i19<-cc$PROFESSIONAL_ZIP_3*cc$i2

cc$i20<-cc$PROFESSIONAL_ZIP_3*cc$i5

cc$j1<-cc$MONTHS_IN_RESIDENCE*cc$EDUCATION_LEVEL. 1

cc$j2<-cc$MONTHS_IN_RESIDENCE*cc$QUANT_CARS

cc$j3<-cc$MARITAL_STATUS*cc$MONTHS_IN_RESIDENCE

cc$j4<-cc$QUANT_CARS*cc$i12

cc$j5<-cc$FLAG_MASTERCARD*cc$i5

cc$j6<-cc$QUANT_CARS*cc$i2

cc$j7<-cc$FLAG_MASTERCARD*cc$i10

cc$j8<-cc$QUANT_CARS*cc$i19

cc$j9<-cc$QUANT_CARS*cc$OthIncomePct

cc$j10<-cc$NACIONALITY*cc$QUANT_CARS

cc$j11<-as.factor(ifelse(cc$FLAG_RESIDENCIAL_PHONE= ='Y',cc$FLAG_MASTERCARD,'O'))

cc$j12<-cc$QUANT_CARS*cc$i7

cc$j 13<-cc$MARITAL_STATUS*cc$j3

cc$j14<-cc$PAYMENT_DAY*cc$j5

cc$j15<-cc$PAYMENT_DAY*cc$j7

cc$j16<-cc$QUANT_CARS*cc$OCCUPATION_TYPE

cc$j17<-cc$OCCUPATION_TYPE*cc$j9

cc$j18<-as.factor(ifelse(cc$j11=='1',cc$OCCUPATION_TYPE,'O'))

cc$j19<-cc$AGE*cc$i2

cc$j20<-cc$OthIncomePct*cc$i2

cc$j21<-cc$i2*cc$i7

cc$j22<-cc$i2*cc$i10

cc$j23<-cc$i2*cc$i15

cc$j24<-cc$i2*cc$j1

cc$j25<-cc$i2*cc$j2

cc$j26<-cc$RESIDENCE_TYPE*cc$AGE

cc$j27<-cc$RESIDENCE_TYPE*cc$i4

cc$j28<-cc$RESIDENCE_TYPE*cc$i7

cc$j29<-cc$RESIDENCE_TYPE*cc$PROFESSION_CODE

cc$j30<-cc$PROFESSION_CODE*cc$PRODUCT

cc$j31<-cc$PRODUCT*cc$i6

cc$k1<-as.factor(ifelse(cc$AGE<=18 & cc$PAYMENT_DAY<=15,'Y','N'))

cc$k2<-as.factor(ifelse(cc$AGE>18 & cc$PAYMENT_DAY<=15,'Y','N'))

cc$k3<-as.factor(ifelse(cc$AGE>21 & cc$PAYMENT_DAY>15,'Y','N'))

cc$k4<-as.factor(ifelse(cc$AGE<=21 & cc$PAYMENT_DAY> 15,'Y','N'))

cc$k5<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 & cc$j11!='O' & cc$PAYMENT_DAY<=10 & cc$SEX!='F','Y','N'))

cc$k6<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 & cc$j11!='O' & cc$PAYMENT_DAY<=10 & cc$SEX=='F','Y','N'))

cc$k7<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 & cc$j11!='O' & cc$PAYMENT_DAY>10 & cc$SEX!='F','Y','N'))

cc$k8<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 & cc$j11!='O' & cc$PAYMENT_DAY>10 & cc$SEX=='F' & cc$j30<=40,'Y','N'))

cc$k8a<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 & cc$j11!='O' & cc$PAYMENT_DAY>10 & cc$SEX=='F' & cc$j30>40,'Y','N'))

cc$k9<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 & cc$j11=='O' & cc$MissingProfPhoneCode!='N','Y','N'))

cc$k10<-as.factor(ifelse(cc$AGE<=46 & cc$AGE>32 & cc$j11=='O' & cc$MissingProfPhoneCode=='Y','Y','N'))

cc$k11<-as.factor(ifelse(cc$AGE>46 & cc$j11=='O' & cc$FLAG_PROFESSIONAL_PHO NE=='Y','Y','N'))

cc$k12<-as.factor(ifelse(cc$AGE>46 & cc$j11=='O' & cc$FLAG_PROFESSIONAL_PHONE=='N' & cc$j16<=0 & cc$PAYMENT_DAY<=20,'Y','N'))

cc$k13<-as.factor(ifelse(cc$AGE>46 & cc$j11=='O' & cc$FLAG_PROFESSIONAL_PHONE=='N' & cc$j16<=0 & cc$PAYMENT_DAY>20,'Y','N'))

#cc$k14<-as.factor(ifelse(cc$AGE>46 & cc$j11=='O' & cc$FLAG_PROFESSIONAL_PHONE=='N' & cc$j16>0,'Y','N'))

cc$k15<-as.factor(ifelse(cc$AGE>46 & cc$AGE<=52 & cc$j11!='O' ,'Y','N'))

cc$k16<-as.factor(ifelse(cc$AGE>52 & cc$j11!='O' & cc$PAYMENT_DAY<=15 & cc$i11<=271633 & cc$j5<=1220 ,'Y','N'))

cc$k17<-as.factor(ifelse(cc$AGE>52 & cc$j11!='O' & cc$PAYMENT_DAY<=15 & cc$i11<=271633 & cc$j5>1220 ,'Y','N'))

cc$k18<-as.factor(ifelse(cc$AGE>52 & cc$j 11!='O' & cc$PAYMENT_DAY<=15 & cc$i11>271633 ,'Y','N'))

cc$k18<-as.factor(ifelse(cc$AGE>52 & cc$j 11!='O' & cc$PAYMENT_DAY>15 ,'Y','N')) #logit

m<-glm(TARGET_LABEL_BAD~.,data=cc,family=binomial)

cc<-subset(cc,select=-j1)

cc<-subset(cc,select=-j2)

cc<-subset(cc,select=-j3)

cc<-subset(cc,select=-j4)

cc<-subset(cc,select=-j5)

cc<-subset(cc,select=-j6)

cc<-subset(cc,select=-j7)

cc<-subset(cc,select=-j8)

cc<-subset(cc,select=-j9)

cc<-subset(cc,select=-j10)

cc<-subset(cc,select=-j11)

cc<-subset(cc,select=-j12)

cc<-subset(cc,select=-j13)

cc<-subset(cc,select=-j14)

cc<-subset(cc,select=-j15)

cc<-subset(cc,select=-j16)

cc<-subset(cc,select=-j17)

cc<-subset(cc,select=-j18)

cc<-subset(cc,select=-j19)

cc<-subset(cc,select=-j20)

cc<-subset(cc,select=-j21)

cc<-subset(cc,select=-j22)

cc<-subset(cc,select=-j23)

cc<-subset(cc,select=-j24)

cc<-subset(cc,select=-j25)

cc<-subset(cc,select=-j26)

cc<-subset(cc,select=-j27)

cc<-subset(cc,select=-j28)

cc<-subset(cc,select=-j29)

cc<-subset(cc,select=-j30)

cc<-subset(cc,select=-j31)

cc<-subset(cc,select=-i1)

cc<-subset(cc,select=-i2)

cc<-subset(cc,select=-i3)

cc<-subset(cc,select=-i4)

cc<-subset(cc,select=-i5)

cc<-subset(cc,select=-i6)

cc<-subset(cc,select=-i7)

cc<-subset(cc,select=-i8)

cc<-subset(cc,select=-i9)

cc<-subset(cc,select=-i10)

cc<-subset(cc,select=-i11)

cc<-subset(cc,select=-i12)

cc<-subset(cc,select=-i13)

cc<-subset(cc,select=-i14)

cc<-subset(cc,select=-i15)

cc<-subset(cc,select=-i16)

cc<-subset(cc,select=-i17)

cc<-subset(cc,select=-i18)

cc<-subset(cc,select=-i19)

cc<-subset(cc,select=-i20)

Most work done in Rattle.

Home Equity Data Set R

#sas home equity data set

#www.sasenterpriseminer.com/data/HMEQ.xls

#Wielenga, D., Lucas, B. and Georges, J. (1999), Enterprise MinerTM: Applying Data Mining Techniques Course

Notes, SAS Institute Inc., Cary, NC.

cc<-read.csv("C:/Documents and Settings/ My Documents/HMEQ.csv")

cc$BAD<-as.factor(cc$BAD)

cc$LTV<-(cc$LOAN+cc$MORTDUE)*100/cc$VALUE

cc$JOB<-as.factor(cc$JOB)

REFERENCES

Avery, Robert B., Raphael W. Bostic, Paul S. Calem, and Glenn Canner, (1996) "Credit risk, credit scoring, and the performance of home mortgages," The Federal Reserve Bulletin, Vol. 82, No. 7, , pp. 621-648

Breiman, L. (2002) Wald 2: Looking Inside the Black Box. Retrieved from www.stat.berkeley.edu/users/breiman/wald2002-2.pdf

Brown, Don (2005) Linear Models Unpublished Manuscript at University of Virginia.

Overstreet, GA; Kemp, RS; (1986) Managerial control in Credit Scoring Systems. Journal of Retail Banking

Overstreet, G.A.J., Bradley, E.L., Kemp, R.S., 1992. The flat-maximum effect and generic linear scoring models: a test, IMA Journal of Mathematics Applied in Business & Industry, 4 (1) 97-109

Sharma, D (2009) Guide to Credit Scoring in R. Retrieved from http://cran.r- project.org/doc/contrib/SharmaCreditScoring.pdf

Sharma, D; Overstreet, George; Beling, Peter (2009) Not If Affordability data adds value but how to add real value by Leveraging Affordability Data: Enhancing Predictive capability of Credit Scoring Using Affordability Data. CAS (Casualty Actuarial Society) Working Paper. Retrieved from http://www.casact.org/research/wp/index.cfm?fa=workingpapers

See Williams, Graham Desktop Guide to Data Mining http://www.togaware.com/datamining/survivor/

Wielenga, D., Lucas, B. and Georges, J. (2009), Enterprise MinerTM: Applying Data Mining Techniques, SAS Institute Inc., Cary, NC. http://www.crc.man.ed.ac.uk/conference/archive/2009/presentations/ Paper-11-Paper.pdf

Dhruv Sharma, Independent Scholar

Logistic regression model Variable Coefficient Std. Error z value Estimate (Intercept) 0.367793 1.008929 0.365 PAYMENT_DAY 0.019525 0.001859 10.505 APPLICATION_SUBMISSION_TYPECarga -0.30733 0.093499 -3.287 APPLICATION_SUBMISSION_TYPEWeb -0.09412 0.059745 -1.575 POSTAL_ADDRESS_TYPE 0.028979 0.151555 0.191 SEXF -1.01841 0.611657 -1.665 SEXM -0.83388 0.611716 -1.363 SEXN -0.96122 0.717716 -1.339 MARITAL_STATUS -0.01168 0.009773 -1.195 QUANT_DEPENDANTS 0.020479 0.010485 1.953 NACIONALITY 0.06682 0.071782 0.931 FLAG_RESIDENCIAL_PHONEY -0.82105 0.720136 -1.14 RESIDENCIAL_PHONE_AREA_CODE 0.000984 0.000398 2.473 RESIDENCE_TYPE -0.01853 0.010731 -1.727 FLAG_EMAIL 0.017635 0.046639 0.378 PERSONAL_MONTHLY_INCOME 5.77E-07 1.48E-06 0.39 OTHER_INCOMES 1.78E-0 1.83E-05 0.971 FLAG_VISA 0.074469 0.042835 1.738 FLAG_MASTERCARD -0.2261 0.046838 -4.827 FLAG_DINERS 0.284333 0.33461 0.85 FLAG_AMERICAN_EXPRESS -0.06287 0.303154 -0.207 FLAG_OTHER_CARDS -0.05443 0.299613 -0.182 QUANT_BANKING_ACCOUNTS -0.00642 0.058358 -0.11 QUANT_SPECIAL_BANKING_ACCOUNTS NA NA NA PERSONAL_ASSETS_VALUE -4.4E-08 3.3E-07 -0.134 QUANT_CARS -0.02769 0.101239 -0.274 COMPANYY -0.06863 0.031724 -2.163 FLAG_PROFESSIONAL_PHONEY 0.714282 0.617823 1.156 PROFESSIONAL_PHONE_AREA_CODE -0.00056 0.000721 -0.782 MONTHS_IN_THE_JOB -0.06383 0.055293 -1.154 OCCUPATION_TYPE 0.026602 0.007269 3.66 MATE_PROFESSION_CODE -0.00679 0.004226 -1.606 EDUCATION_LEVEL.1 0.000307 0.019259 0.016 PRODUCT 0.034652 0.011976 2.894 AGE -0.01968 0.000975 -20.193 MissingResidentialPhoneCodeY -0.17667 0.720139 -0.245 MissingProfPhoneCodeY 0.804464 0.619538 1.298 Variable Pr(> Significance [absolute value of z]) (Intercept) 0.715457 PAYMENT_DAY < 2e-16 *** APPLICATION_SUBMISSION_TYPECarga 0.001012 ** APPLICATION_SUBMISSION_TYPEWeb 0.115164 POSTAL_ADDRESS_TYPE 0.848362 SEXF 0.095913 SEXM 0.172827 SEXN 0.180481 MARITAL_STATUS 0.231949 QUANT_DEPENDANTS 0.050805 NACIONALITY 0.351919 FLAG_RESIDENCIAL_PHONEY 0.254232 RESIDENCIAL_PHONE_AREA_CODE 0.013386 * RESIDENCE_TYPE 0.084166 FLAG_EMAIL 0.705338 PERSONAL_MONTHLY_INCOME 0.69655 OTHER_INCOMES 0.331516 FLAG_VISA 0.082123 FLAG_MASTERCARD 1.38E-06 *** FLAG_DINERS 0.395467 FLAG_AMERICAN_EXPRESS 0.835701 FLAG_OTHER_CARDS 0.85585 QUANT_BANKING_ACCOUNTS 0.912419 QUANT_SPECIAL_BANKING_ACCOUNTS NA PERSONAL_ASSETS_VALUE 0.893551 QUANT_CARS 0.784451 COMPANYY 0.030512 * FLAG_PROFESSIONAL_PHONEY 0.247629 PROFESSIONAL_PHONE_AREA_CODE 0.434309 MONTHS_IN_THE_JOB 0.24833 OCCUPATION_TYPE 0.000252 *** MATE_PROFESSION_CODE 0.108175 EDUCATION_LEVEL.1 0.987279 PRODUCT 0.00381 ** AGE <2e-16 *** MissingResidentialPhoneCodeY 0.806201 MissingProfPhoneCodeY 0.194119 Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 (Dispersion parameter for binomial family taken to be 1) Residual deviance: 39312 on 34964 degrees of freedom AIC:39384 Number of Fisher Scoring iterations 4 Log likelihood: -19655.757 (36 df) Null/Residual deviance difference: 906.696 (35df) Chi-square p-value: 0.00000000 Home Equity Logistic Regression Variables Coefficient Std. z value Estimated Error (Intercept) -17.07851715 524.8953 -0.033 LOAN 0.000001803 1.58E-05 0.114 MORTDUE 0.000019897 1.26E-05 1.576 VALUE -0.000016881 1.13E-05 -1.501 REASONDebtCon -0.621936258 0.635508 -0.979 REASONHomelmp -0.753124539 0.647779 -1.163 JOBMgr 14.79633915 524.8937 0.028 JOBOffice 14.22345444 524.8938 0.027 JOBOther 14.67559173 524.8937 0.028 JOBProfExe 14.83695589 524.8937 0.028 JOBSales 15.91157826 524.8939 0.03 JOBSelf 15.92432142 524.8939 0.03 YOJ -0.005838696 0.012261 -0.476 DEROG 0.802276888 0.123378 6.503 DELINQ 0.817538124 0.085566 9.554 CLAGE -0.00580995 0.001307 -4.445 NINQ 0.155918992 0.042991 3.627 CLNO -0.027956215 0.009666 -2.892 DEBTINC 0.101303396 0.012958 7.818 LTV -0.020306186 0.011472 -1.77 -- Variables Pr(> Significance [absolute value of z]) (Intercept) 0.974044 LOAN 0.909386 MORTDUE 0.115104 VALUE 0.133474 REASONDebtCon 0.327756 REASONHomelmp 0.244982 JOBMgr 0.977511 JOBOffice 0.978382 JOBOther 0.977695 JOBProfExe 0.97745 JOBSales 0.975817 JOBSelf 0.975797 YOJ 0.63394 DEROG 7.89E-11 *** DELINQ < 2e-16 *** CLAGE 8.79E-06 *** NINQ 0.000287 *** CLNO 0.003827 DEBTINC 5.38E-15 *** LTV 0.076707 -- Signif. codes: 0 '***'0.00 1 '**' 0.01 '*' 0.05 '.' 0.1 '' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 1472.1 on 2431 degrees of freedom Residual deviance: 1090.8 on 2412 degrees of freedom (1740 observations deleted due to missingness) AIC: 1130.8 Number of Fisher Scoring iterations: 16 Log likelihood : -545.408 (20 df) Null/Residual deviance difference: 3813 (19 df) Chi-square p-value: 0.0000000 Logistic regression Model Variable Estimate Std. Error z value (Intercept) -1.33216 0.031749 -41.959 trades -0.00552 0.001959 -2.819 30 number dlq (not worse) 0.501922 0.008535 58.807 60 number dlq (not worse) -0.94516 0.014058 -67.233 90 day number dlq (not worse) 0.478619 0.012085 39.605 mtg_trd_lines 0.095229 0.00804 11.844 monthly income -3.6E-05 2.29E-06 -15.642 age -0.02729 0.000655 -41.631 revolving balance util 2.55E-05 2.95E-05 0.865 DebtRatio -0.00015 3.61E-05 4.222 0 -- Variable Pr(> Significance [absolute value of z[) (Intercept) <2e-16 *** trades 0.00482 ** 30 number dlq (not worse) <2e-16 *** 60 number dlq (not worse) <2e-16 *** 90 day number dlq (not worse) <2e-16 *** mtg_trd_lines <2e-16 *** monthly income <2e-16 *** age <2e-16 *** revolving balance util 0.38731 DebtRatio 2.42E-05 *** -- Signif. codes: 0 '***' 0.001 '*' 0.01 '*' 0.05 '.' 0.1 '' 1 (Dispersion parameter for binomial family taken to be 1) Null deviance: 118040 on 235272 degrees of freedom Residual deviance: 108889 on 235263 degree degrees of freedom (58148 observations AIC: 108909 Number of Fisher Scoring iterations: 6 Log likelihood:-54444.557 (f) Null/Residual deviance difference: 9151.24 (9 df) Chi-square p-value : 0.00000000