Credit Scoring and Data Mining
Computer Lab Workshop 1 using Excel
Using Solver in Excel to build scorecard by linear programming
We will use the data on 100 customers that is found on cc100.xls
Open that Excel file and note what the data means.
Most of the data describes the customer attributes but some describe the subsequent history.
DL1 is number of times in year 1 month overdue, DL2 is number of times 2 months overdue etc. Definition of bad=1 is at least once 1 month overdue (i,e. 1 or more in DL1).
We will do the analysis in the reverse order to what is normally done. In that we will assume we already have split the variable into their classes ( groupings of the values of a variable) and this section will see how we apply linear programming and regression to find the scores for these characteristic classes. Again in real scorecard building one would use dedicated statistical software like SAS or SPSS or specific credit scoring software like Paragon’s DSS..
1.Move to Sheet 2 of the spreadsheet.
Here we have taken the age, number of children and employment status variables and coarse classified them. So Date of Birth has been split into bands of 1918-1932, 1933-1948, 1949-1954 , 1955-1963 ( and of course those with no entry in any of these are born after 1963- the data was obtained in 1988 ). The set has been reordered so all the bads ( those where the good=1 variable has value 0 ) are the first 39 and the goods the remaining 61. The P column has a 1 for the bads and –1 for the goods and the Q column will be the errors. The entries in E102 to O102 are the weights of the corresponding attributes in the scorecard
For the bads we want w1 x1+…wp xp- error to be less than the cutoff;
For the goods we want w1 x1+…wp xp+ error to be greater than the cutoff;
To make the distinction clearer we will want the bads to be less than 80 and the goods to be greater than 100. To do this proceed as follows
In cell A102 type score
In cell e102 type 1 and then copy this across f102 to o102.
This means we start the LP with each attribute having a score weight of 1 (any value is ok).
In cell q1 type error
In cell q2 type 0 and copy this down to q101
This means we will start the LP with all the errors zero, which will be infeasible . (Again any values can be put in here initially)
In cell r1 type sum
In cell r2 type =sumproduct ( e2:o2,e$102:o$102) – p2*q2
Copy cell r2 down the column to r101
( Check you understand what sumproduct does. This gives the score for each person, if the individual attribute weights are those in row 102 , less any errors for the bads and plus any errors for the goods)
In cell s2 type 80
Copy cell s2 down to s40
In cell s41 type 100
Copy cell s41 down the column to s102
This gives the cut-offs we will use. We want bads to be below 80 and goods to be above 100.#p#分页标题#e#
In cell q102 type = sum(q2:q101)
This is the sum of the errors.
We are now in a position to construct and solve the linear programme
Go to Solver in the tools menu ( If it is not there go to Add –ins in the tool menu and click on solver . Solver will then appear in the Tools menu)
When the dialogue box appears
In “set target cell” type or highlight q102
Click on the minimise button
Either by typing in or clicking on the appropriate cells, put in the” by changing cells” box
E102:O102, Q2:Q101
Using the add button put in the following constraints
R2:R40 <= S2:S40
R41:R101 => S41:S101
Q2:Q101=> 0
Check that you understand why these are the constraints you would expect in the LP formulation of a credit scorecard.
Click on the options button and then on the ”assume linear model” button”. We do not want the non-negative button as the variables E102:O102 might be negative.
Click on OK and then solve. Interpret solution.
2. What happens if bads have to be less than 60 and goods have to be greater than 120.
3. Go to Sheet 3 where residential status has been split into three classes “O”, “U” and the rest and the cheque guarantee card variables is also used. Repeat the process to build a scorecard using those two variables only.
Using Regression to build a scorecard by linear regression
4. Having got the variables we can use regression rather than linear programming to obtain the scores.
Return to Sheet 2.
Go to Data Analysis in the Tools menu and click on Regression.
When the Dialogue box comes up enter ( either by typing or by highlighting) as follows
In Input Y Range D1:D101
Input X range E1:O101
Click on labels, confidence levels and residuals
Let output be in a new worksheet Then click on OK.
The result will be that
100Y= 4 +13.5(18-32) -15.5X(33-48) -44X(49-54) +10X(55 -63) + 69X(0child) +66X(1child) + 74X(2child) + 27X(B,Y) –16X(Z, no response) +17X(P,E,L)
So what score does some one born in 1948 with 2 children and occupation type B get?
What is the score for someone born in 1923 with 4 children who did not respond to the employment question?
How good a fit do you feel the scorecard is?
How do the relative scores compare with that obtained by linear programming?
5. Build a regression scorecard using the residential status and cheque guarantee card variables you have analysed
|