Risks 2019 , 7, x doi: FOR PEER REVIEW www.mdpi.comjournal risks [623012]

Risks 2019 , 7, x; doi: FOR PEER REVIEW www.mdpi.com/journal/ risks
Article 1
The Population Accuracy Index: A new measure of 2
population stability for model monitoring 3
4
Ross Taplin 1* and Clive Hunt2,* 5
1 Curtin University; [anonimizat] 6
2 [anonimizat] 7
* Correspondence: [anonimizat] 8
Received: date; Accepted: date; Published: date 9
Abstract: Risk models developed on one dataset are often applied to new data and in such cases it 10
is prudent to check that the model is suitable for the new data. An important applicati on is the 11
banking industry, where statistical models are applied to loans to determine provisions and capital 12
requirements. These models are developed on historical data and regulations require their 13
monitoring to ensure they remain valid on current portfo lios, often years since the models were 14
developed. The Population Stability Index (PSI) is an industry standard to measure whether the 15
distribution of the current data has shifted significantly from the distribution of data used to 16
develop the model. This paper explores several disadvantages of the PSI and proposes the 17
Prediction Accuracy Index (PAI) as an alternative. The superior properties and interpretation of the 18
PAI are discussed and it is concluded that the PAI can more accurately summarise the leve l of 19
population stability, helping risk analysts and managers determine whether the model remains fit – 20
for-purpose. 21
Keywords: Population Stability Index (PSI); Basel Accord; IFRS 9; model monitoring; model 22
validation. 23
24
1. Introduction 25
For banks, their loans are not only assets as they are income producing but also liabilities when 26
customers default and do not repay their debt. In many jurisdictions accounting for these liabilities are 27
measured by procedures in regulations such as the Basel Accord (Basel Committee on Banking 28
Supervision, 2006) for capital and the International Financial Reporting Standards (IFRS 9) for 29
provisioning (International Accounting Standards Board, 2014). Capital is required in case of a severe 30
economic downturn while provisions reflect losses expected in current economic conditions. As these 31
valuations form part of the value of the company their accuracy is important to many stakeholders. 32
These stakeholders include: the bank itself (for example, to make profitable acquisition de cisions for 33
new loans); external auditors (who assess the accuracy and reliability of financial statements); 34
regulators (who assess the sustainability of the bank); and investors (who rely on this information to 35
make investment decisions). 36
Both the Basel A ccord and the IFRS 9 adopt a standard approach of assessing risk of loans with 37
three components: probability of default (PD); exposure at default (EAD) and loss given default (LGD). 38
Thus three models are required to respectively predict the likelihood of a loan defaulting (unable to 39
make its contractual obligations, typically 90 days overdue in payments); the balance owing at the time 40
of default; and the monetary loss to the bank in the case of default (expressed as a fraction of the EAD). 41
Expected loss mig ht be estimated with the product PD × EAD × LGD. 42

Risks 2019 , 7, x FOR PEER REVIEW 2 of 11
Model development in the banking industry is well covered in the literature (Siddiqi, 2005) but an 43
equally important regulated activity is the continual monitoring of whether the model remains suitable 44
(fit-for-purpose). For example: 45
Banks that have adopted or are willing to adopt the Basel II A -IRB approach are required 46
to put in place a regular cycle of model validation that should include at least monitoring 47
of the model performance and stability, review o f the model relationships and testing of 48
model outputs against outcomes (i.e., backtesting). Sabato (2010; p40) 49
and also: 50
Stability and performance (i.e., prediction accuracy) are extremely important as they 51
provide information about the quality of the scoring models. As such, they should be 52
tracked and analyzed at least on monthly basis by banks, regardless of the validation 53
exercise. Sabato (2010; p40) 54
55
This aspect is typically performed internally and externally; by bankers, auditors and regulators. 56
Monitoring is important because a model developed years earlier may no longer be fit -for-purpose for 57
the current portfolio. On e reason for this is the type of customers within the portfolio may differ from 58
the types of customers available to develop the model. 59
Population stability refers to whether the characteristics of the portfolio (especially the distribution 60
of explanatory variables) is changing over time. When this distribution changes (low population 61
stability) there is more concern over whether the model is currently fit -for-purpose since the data used 62
to develop the model differs from the data the model is being applied to. Applying the model to these 63
new types of customers might involve extrapolation and hence lower confidence in model outputs. 64
There are other characteristics of a model that require monitoring to ensure the model is fit -for- 65
purpose. These include calibra tion (whether the model is unbiased) and discrimination (whether the 66
model correctly rank orders the loans from best to worst). While these measures are important, they 67
require known outcomes. For example, a PD model predicting defaults in a one year windo w must 68
evaluate loans at least one year old to determine calibration and discrimination. Therefore conclusions 69
from these measures are at least one year out of date compared to the current portfolio. 70
Population stability is important as it requires no lag ; it can be measured with the current portfolio 71
since the outcome is not required. Therefore it is important to monitor population stability to gain 72
insights concerning whether the current portfolio (rather than the portfolio one year ago) is fit -for- 73
purpo se. 74
This paper focuses on the measurement of population stability, especially the Population Stability 75
Index (PSI) which has become an industry standard. Deficiencies in the PSI are explored and an 76
alternative that has superio r properties and whose values are more directly interpretable is introduced. 77
78
1.1 Models and Notation. 79
Model development tasks are extensive and well covered in the literature, of which Siddiqi (2005) 80
is particularly relevant to the banking industry. Brie fly, empirical historical data is used to estimate 81
relationships between an outcome (such as default in the case of a PD model) and explanatory variables 82
(such as employment status of the customer). PD models typically estimate probabilities of default 83
within one year, so for model development explanatory variables must be at least one year old (so the 84
outcome is known). The model development then looks for, and captures in mathematical form, 85
relationships in the data between the explanatory variables and the outcome. For example, this may 86
take the form of a logistic or probit regression model predicting default. This mathematical form often 87
takes the form of a regression where some (possibly transformed) measure of the outcome equals 88
𝛽0𝑥𝑖0+𝛽1𝑥𝑖1+⋯+𝛽𝑘𝑥𝑖𝑘 (1) 89
where 𝛽0 to 𝛽𝑘 are estimated coefficients and 𝑥𝑖0 to 𝑥𝑖𝑘 are the values of the explanatory (numerical) 90
variables for the 𝑖th observation (typically xi0 is defined to always equal 1, in which case β0 is an 91
intercept). 92
The explanatory variables have several basic types whose treatments are summarized here because 93
these affect the details presented later (see Pyle (1999) for further details of these issues and treatments). 94

Risks 2019 , 7, x FOR PEER REVIEW 3 of 11
In particular, variables might be categorical or numerical. Categorical variables (such as occupation 95
category) take values from a list (such as trade, professional, retired, student, etc.) and typically have 96
no natural orderi ng or numerical value. Modelling might create n -1 (where n is the number of 97
categories) dummy variables (taking numerical values of 0 or 1) or by numeration where a numerical 98
value (the weight of evidence) is assigned to each category (Siddiqi, 2005). For example, numerical 99
values might be determined from the observed default rate within each category. 100
Numerical variables are defined in numerical terms. For example, the loan to value (LVR) ratio is 101
defined as the value of the loan divided by the value of th e asset securing the loan. Modelling might 102
use this numerical value directly, after a simple numerical transformation (such as logarithms or 103
Winsorizing) or by bucketing into a small number of categories (such as 0 to 0.5; 0.5 to 0.8; 0.8 to 1; and 104
>1). Th us bucketing transforms a numerical variable into a categorical variable (which in turn may be 105
numerated with weight of evidence or dummy variables during model development). As expanded on 106
below, this is a key issue not only because bucketing is a common practice in banking but because the 107
𝑃𝑆𝐼 is only defined for categorical variables (or bucketed numerical variables). 108
109
1.2 The Population Stability Index (PSI) 110
The 𝑃𝑆𝐼 is closely related to well -established entropy measures, and essentially is a symmetric 111
measure of the difference between two statistical distributions. The index specifically called ‘Population 112
stability index’ ( 𝑃𝑆𝐼) is found in Karakoulas (2004), as a “diagnostic technique for monitoring shifts in 113
characteristics distributions” . It is also described in Siddiqi (2005), who explains its use to either monitor 114
overall population score stability (“System stability report”) or, as a likely follow -up, the stability of 115
individual explanatory variables (“Characteristic analysis report”) in credit risk modelling scorecards 116
for the banking industry. The same formulation has appeared in the statistical literature as the “J 117
divergence” (Lin, 1991, who in turn references Jeffreys, 1946), and is closely related to the Jensen – 118
Shannon divergence. 119
The formula for the 𝑃𝑆𝐼 assumes there are K mutually exclusive categories, numbered 1 to K, 120
with: 121
PSI=∑(Oi-K
i=1Ei)×ln⁡(Oi
Ei) (2) 122
where 123
𝑂𝑖 is the observed relative frequency of accounts in category 𝑖 at review; 124
𝐸𝑖 is the relative frequency of accounts in category 𝑖 at development (the review relative frequency is 125
expected to be similar to the development relative frequency); 126
𝑖 is the category, taking values from 1 to K; and 127
ln⁡() is the natural logarithm. 128
A 𝑃𝑆𝐼 value of 0 implies the observed and expected distributions are identical with the 𝑃𝑆𝐼 129
increasing in value as the two distributions diverge. Siddiqi (2006) interpreted 𝑃𝑆𝐼 values as follows: 130
less than 10% show no significant change; values between 10% and 25% show a small change requir ing 131
investigation; and values greater than 25% show a significant change. Note the 𝑃𝑆𝐼 is large when a 132
category has either the observed or expected relative frequency close to zero and is not defined if either 133
relative frequency equals 0 . Therefore a limit argument suggests the 𝑃𝑆𝐼 might be interpreted as 134
having an infinite value when one of the relative frequencies equals zero. 135
The calculation of the 𝑃𝑆𝐼 is illustrated with a hypothetical example in Table 1. A 𝑃𝑆𝐼 of 0.25 136
results primarily from the high observed frequencies of 21% in categories 1 and 10. Thus the 137
interpretation recommended by Siddiqi (2006) suggests the distribution of the data has changed 138
significantly from development to review. 139
140
Table 1. Calcula tion of the 𝑃𝑆𝐼 (example 1). 141
1 2 3 4 5 6 7 8 9 10 Total
𝑂𝑖 21% 9% 7% 7% 6% 6% 7% 7% 9% 21% 100%
𝐸𝑖 10% 10% 10% 10% 10% 10% 10% 10% 10% 10% 100%
(𝑂𝑖−𝐸𝑖)×ln⁡(𝑂𝑖
𝐸𝑖) .082 .001 .011 .011 .020 .020 .011 .011 .001 .082 .25

Risks 2019 , 7, x FOR PEER REVIEW 4 of 11
142
Table 2 shows the calculation of the 𝑃𝑆𝐼 for a second hypothetical example that also results in a 143
value of 0.25 for the 𝑃𝑆𝐼. The similar 𝑃𝑆𝐼 values are interpreted to mean the deviations from 144
development in the two examples are similar in magnitud e. 145
146
Table 2. Calculation of the 𝑃𝑆𝐼 (example 2). 147
1 2 3 4 5 6 7 8 9 10 Total
𝑂𝑖 3% 8% 11% 14% 15% 15% 14% 11% 8% 3% 100%1
𝐸𝑖 10% 10% 10% 10% 10% 10% 10% 10% 10% 10% 100%
(𝑂𝑖−𝐸𝑖)×ln⁡(𝑂𝑖
𝐸𝑖) .084 .004 .001 .013 .020 .020 .013 .001 .004 .084 .25
1 observed values at review are rounded but sum to 100% 148
149
The similar interpretations based on the 𝑃𝑆𝐼 from these examples might be reasonable if the 10 150
categories represent categorical divisions, such as industry sector. However, this is quest ionable if the 151
categories represent a division of a continuous scale used in model development. For example, the 152
explanatory variable might be the loan to value (LVR) ratio: value of the loan divided by the value of 153
the asset securing the loan (a common an d intuitive predictor of loss). This continuum is divided into 154
10 categories as this is required for calculation of the 𝑃𝑆𝐼 (it may or may not have been a modelling 155
choice). Failing to take this information into account and instead treating the categori es as unordered 156
can lead to misleading conclusions concerning whether the model remains fit -for-purpose. 157
The categories in Tables 1 and 2 were constructed from underlying development data with a 158
standard normal distribution, with intervals defined to each capture 10% of the distribution: -infinity 159
to -1.28; -1.28 to -0.84; -0.84 to -0.52, etc. The review data in Table 1 was also normally distributed, but 160
with a standard deviation of 1.6 instead of 1, creating more observations in the extreme categories 161
(below -1.28 and above 1.28 respectively). Observed frequencies were rounded to the nearest percent to 162
ensure the calculations use the exact observed and expected values in Table 1. Similarly, the review data 163
in Table 2 was constructed with a standard deviation of 0.674, creating less data in the extreme 164
categories. This is illustrated in Figure 1. 165
166
Figure 1. Distribution of continuous explanatory variable used to generate the development (solid 167
line) and review data for example 1 (dashed line) and example 2 (dotted line). Boundaries of categories 168
1 to 10 are defined by the vertical lines, dividing the scale into 10 intervals each with 10% of the 169
development data. 170
171
172
173
174

Risks 2019 , 7, x FOR PEER REVIEW 5 of 11
Although the 𝑃𝑆𝐼 value in Tables 1 and 2 both equal 0.25, the extent to which the model is fit -for- 175
purpose for the corresponding review data is not. In Table 1, the model is being applied to more extreme 176
data than was available at development. Confidence that the data is suitable for this review data should 177
be low; especially when the model is being extrapolated from development data to the more extreme 178
review data. Not only will a small change in estimated coefficients have a larger impact on the predicted 179
value for these observations, but we have less confidence in the validity of assumptions such as linearity 180
of relationships between response and explanatory variables. In contrast, the review dat a in Table 2 181
suggests no extrapolation is involved. If the model was considered fit -for-purpose at development then 182
this change in distribution gives no reason to suggest the model is no longer fit -for-purpose: if it was fit 183
for standard normal data (95% o f which is within -1.96 and +1.96) then it should be fit for the review 184
data (95% of which is within -1.35 and +1.35). These examples illustrate how the 𝑃𝑆𝐼 captures any 185
differences between the development and review data rather than focussing on those differences that 186
suggest the model is not fit for the purpose of estimation on the review data. 187
The 𝑃𝑆𝐼 is typically calculated for each independent variable in the model. It can also be 188
computed for variables not in the model, such as variables conside red serious candidates during 189
modelling, however since a separate 𝑃𝑆𝐼 value is obtained for each variable this can result in numerous 190
quantitative results when a single value summarizing stability is desirable. To avoid this issue of 191
multiple values sum marising population stability the 𝑃𝑆𝐼 can be computed on the model output (or 192
score) instead, however this requires placing the typically numeric model output into categories before 193
calculation. 194
Finally, the value of the 𝑃𝑆𝐼 can be influenced by the number and choice of categories. Too many 195
categories and the 𝑃𝑆𝐼 can detect minor differences in the distribution; too few categories and it may 196
miss differences (for example, if two categories, one with a high frequency and on e with a low 197
frequency, are combined to form a single category). This can create interpretation issues as it is not 198
always clear whether the categories used are determined a priori or whether they are chosen to smooth 199
out differences in the distributions. This is an important issue in practice as the categories for the 𝑃𝑆𝐼 200
often have to be chosen after inspection of the data. In particular, the 𝑃𝑆𝐼 has unreliable properties 201
when frequencies for a category approach 0. Furthermore, due to the necessity to create categories for 202
numerical variables, extreme outliers have minimal impact on the 𝑃𝑆𝐼 even though they may have 203
significant impact on model accuracy: if the model uses a numerical variable then assessing population 204
stability using a categorical (bucketed) version may not capture changes in stability appropriately. 205
206
2 The Prediction Accuracy Index (PAI) 207
The Prediction Accuracy Index ( PAI) is defined as the average variance of the estimated mean 208
response at review divid ed by the average variance of the estimated mean response at development. 209
As with the 𝑃𝑆𝐼, in this definition it is the values of the explanatory variables (design space) that is 210
important; the values of the response are irrelevant and not required. The PAI is high when at review 211
the explanatory variables take values that result in a variance of the predicted response that is higher 212
than the corresponding variance at development. The cases of a single numeric variable, multiple 213
regression and a categorical variable are considered in the following three sections. Note that these 214
sections are presented for ordinary least square regression where the response is normally distributed, 215
however the above definition of the 𝑃𝐴𝐼 can be applied to any model (eg. a neural network) where 216
variances of estimated mean responses are available (b y techniques such as bootstrapping if necessary). 217
Unlike the 𝑃𝑆𝐼 which is defined on a scale with no obvious interpretation, the 𝑃𝐴𝐼 measures the 218
increase in variance of estimated mean response since development. For example, a 𝑃𝐴𝐼 value of 2 is 219
directly interpretable as the variance of the predicted mean response at review is double the variance 220
of the mean response at development (on average). It is recommended that PAI values are interpreted 221
as follows: values less than 1.1 indica te no significant deterioration; values from 1.1 to 1.5 indicate a 222
deterioration requiring further investigation; and values exceeding 1.5 indicate the predictive accuracy 223
of the model has deteriorated significantly. Note that these guidelines are more str ingent than the 224
interpretations by Siddiqi (2006) for the 𝑃𝑆𝐼 (note in Table 1 the 𝑃𝑆𝐼 was 0.252, the boundary of a 225
significant change, but the 𝑃𝐴𝐼 equals 1.78, well above the recommended boundary of 1.5). These more 226

Risks 2019 , 7, x FOR PEER REVIEW 6 of 11
stringent recommendations are based on several factors: a value of 𝑃𝐴𝐼 equal to 1.5 corresponds to 227
review data having a standard deviation 1.4 times the standard deviation of development data (if 228
distributions are normal), which is a significant increase; a 𝑃𝑆𝐼 greater than 0.25 is rare; and since the 229
𝑃𝐴𝐼 is more focussed on model predictive accuracy it has more power at detecting deterioration in this 230
important characteristic specific to the model. 231
232
2.1 Simple regression 233
In the case of simple linear regression (equation (1) with 𝑘=1 and 𝑥𝑖0 is defined to always equal 234
1), the variance of the estimated mean response when the explanatory variable 𝑥𝑖 is equal to 𝑧 is given 235
by (Ramsey and Schafer, 2002, p187): 236
MSE×(1
n+(z-x̅)2
∑(xi-x̅)2) (3) 237
where 238
MSE is the mean squared error (of the residuals) from model development; 239
x̅ is the mean value of the explanatory variable 𝑥 at development, 240
𝑛 is the sample size at development; 241
and the summation is over all values 𝑥𝑖 of the explanatory variable used during scorecard 242
development. Note the average value of equation (3), averaging over values of 𝑧 equal to the 243
development data 𝑥𝑖, is equal to 𝑀𝑆𝐸×2/𝑛. 244
The PAI for simple linear regression equals equat ion (3) averaged over all values of 𝑧 equal to the 245
review data ( noted 𝑟𝑗; 𝑗 = 1 to 𝑁) divided by equation (3) averaged over all values of 𝑧 equal to the 246
development data ( denoted 𝑥𝑖; 𝑖 = 1 to 𝑛): 247
PAI=1
2×(1+∑(rj-x̅)2/NN
j=1
∑(xi-x̅)2 n
i=1/n). (4) 248
Note that the sum of squares in both the numerator and the denominator are centred on the 249
average of the explanatory variable in the development data (not the average of the review data). 250
Applying equation (4) to the normally distributed review data in Table 1 gives a value for the 𝑃𝐴𝐼 251
of 1.78. That is, the variance of the estimated mean response is, on average, 78% higher when calculated 252
on the review data than when calculated on the de velopment data. This is directly interpretable as the 253
model being 78% less precise on the review data than on the development data. In contrast, the 𝑃𝐴𝐼 254
equals 0.73 for the review data in Table 2 and hence the model is on average more accurate on the 255
review data in Table 2 than it was on the development data. This is interpretable as the model being 256
27% more precise on the review data than on the development data. 257
258
2.2 Multiple regression 259
In the case of a multiple regression model given by equation (1), the estimated variance of the mean 260
response when the explanatory variables 𝑥𝑖1,𝑥𝑖2,…,𝑥𝑖𝑝 take values 𝑧𝑖1,𝑧𝑖2,…,𝑧𝑖𝑝 is given by (Johnson 261
and Wichern, 2007, p378): 262
MSE×zjT(XTX)-1zj (5) 263
where 264
𝑧𝑗𝑇=(𝑧𝑖1,𝑧𝑖2,…,𝑧𝑖𝑝) is the row vector of explanatory variables ( zi1=1 when an intercept is included); 265
X is the matrix of explanatory variables at development; 266
MSE is the mean squared error (of the residuals) from model development; 267
T indicates transpose; and 268
()-1 denotes matrix inverse. 269
The columns of 𝑋 equal the values of the explanatory variables of the development data (the rows 270
are similar to 𝑧𝑗𝑇 for each observa tion in the model development data). In practice, equation (5) can be 271
calculated with: 272
zjTVzj (6) 273
where 274

Risks 2019 , 7, x FOR PEER REVIEW 7 of 11
V=MSE×(XTX)-1 is the variance -covariance matrix of the estimated regression coefficients 275
(β1,β2,…,βp) and is available from most regression software. 276
The 𝑃𝐴𝐼 for multiple regression is defined as the average of equation (6) calculated at t he values 277
of the explanatory variables 𝑧𝑗 at review divided by the average of equation (6) calculated at the values 278
of the explanatory variables 𝑧𝑗 at development: 279
PAI=∑rjTVrj/NN
j=1
∑xiTVxin
i=1/n. (7) 280
where 281
rj is the vector of explanatory variables for the 𝑗th observation of the review data ( 𝑗 = 1 to 𝑁); 282
xi is the vector of explanatory variables for the 𝑖th observation of the development data ( 𝑗 = 1 to 𝑛). 283
The following sections apply this formula to the cases where a single categorical variable has more 284
than two categories (requiring multiple regression with dummy variables to estimate the mean 285
response for each category) and a multiple regression where the model contains several explanatory 286
variables. 287
288
2.3 One categorical variable 289
Applying equation (7) to a categorical variable requires the construction of dummy variables to 290
model the differences between the categories; the multiple regression requires a n umber of parameters 291
(including intercept) equal to the number of categories. This results in the 𝑃𝐴𝐼 taking the value of 1 for 292
data in both Table 1 and Table 2. Indeed, for any distribution of review data across these 10 categories 293
the 𝑃𝐴𝐼 will alway s equal 1 if the development data is equally distributed across the categories. This is 294
because in this case the response can be measured with the same precision for each category; a shift in 295
customers from one category at the time of development to anothe r category at review has no impact 296
on model precision if both categories were estimated with equal precision at development. The 𝑃𝑆𝐼 297
does not share this property, instead capturing the extent to which the distribution of the review data 298
deviates from the equal frequency distribution at development. 299
This invariance property does not hold if the categories are not equally frequent at development. 300
A shift in customers from one category at the time of development to another category at review will 301
give a higher value for the 𝑃𝐴𝐼 if the customers move into a category that, at development, had a lower 302
frequency. To illustrate, if the roles of development and review data are reversed in the examples, so 303
Table 2 now involves extrapolation (from 3% of development data to 10% review data in categories 1 304
and 10), the 𝑃𝐴𝐼 is 1.60. The 𝑃𝐴𝐼 for Table 1 with reversed data is 0.70. This asymmetry property of 305
the 𝑃𝐴𝐼 is arguably desirable as extrapolation and interpolation are not equivalent with regards to 306
model accuracy. The 𝑃𝑆𝐼, however, was designed to possess this symmetry as there was no distinction 307
between development and review in its conception (reversing the roles of review and development 308
data gives the same value for the 𝑃𝑆𝐼). 309
310
3 The Multivariate Predictive Accuracy Index ( 𝑴𝑷𝑨𝑰 ) 311
The M ultivariate Predictive Accuracy Index ( 𝑀𝑃𝐴𝐼 ) is defined as equation (7) using all the 312
explanatory variables in the model. While this is mathematically equivalent to the case of multiple 313
regression, equation (7), it is discussed separately in this secti on because considering all explanatory 314
variables is important and not feasible with the 𝑃𝑆𝐼. The 𝑃𝑆𝐼 can not easily be applied to multivariate 315
distributions of many variables because it requires categories and ensuring there are enough categories 316
to capture the multidimensional space typically results in too many categories, many of which will have 317
frequencies of development or review data too close to zero. 318
To illustrate the importance of the 𝑀𝑃𝐴𝐼, consider a model with two explanatory variables that 319
are positively correlated at development (circles) and at review (stars) in Figure 2. While the review 320
data does not represent extreme variables for either variable (the most extreme observation for either 321
variable is always in the development data) t here is a visual pattern whereby the review data tends to 322
be in the lower right corner (high 𝑥1and low 𝑥2) where no development data exists. Thus in a 323
multivariate sense extrapolation is involved with this data and the model estimated using the 324
developme nt data may not be fit -for-purpose for the review data. 325

Risks 2019 , 7, x FOR PEER REVIEW 8 of 11
326
327
Figure 2. Hypothetical development data (circles) and review data (stars) for two explanatory 328
variables 𝑥1 and 𝑥2). 329
330
331
332
Applying the 𝑀𝑃𝐴𝐼 to the data in Figure 2 (three parameters are estimated; one for each of the 333
two explanatory variables and one for the intercept) produces a 𝑃𝐴𝐼 value of 5.43. This suggests the 334
accuracy of the estimates have a variance at review that is over 5 times h igher than the variance at 335
development. The univariate 𝑃𝐴𝐼 from equation (4) are respectively 0.93 for variable 𝑥1 and 1.02 for 336
variable 𝑥2 (similarly acceptable values are obtained with the 𝑃𝑆𝐼 if reasonable categories are 337
created). Thus this mult ivariate 𝑃𝐴𝐼 is a significant contribution in its own right as it enables 338
deviations between the development and review multivariate distributions to be detected that 339
univariate statistics may not1. 340
To avoid confusion, it is recommended that the term M ultivariate Predictive Accuracy Index 341
(𝑀𝑃𝐴𝐼) is used when all variables are included and the term Univariate Predictive Accuracy Index 342
(𝑈𝑃𝐴𝐼) is used when only one variable is included at a time. The term 𝑃𝐴𝐼 can be used for either case. 343
Note th at the 𝑈𝑃𝐴𝐼 may involve multiple regression even though only one variable is considered; 344
examples include the treatment of a categorical variable or when a quadratic term is included with a 345
numerical variable to avoid making assumptions of linearity (a s discussed in the next section). 346
347
4 Discussion 348
This paper considers the requirement of stability: that the data used to develop a model is 349
representative of the data the model is currently applied to. The industry standard in banking to 350

1 Univariate statistics such as the 𝑃𝑆𝐼 may detect the pattern in Figure 1 if the score combining these
variables is analysed however there is no guaranteed of this. For example, a score equal to the sum of 𝑥1 and
𝑥2 in Figure 1 will produce similar distributions in score for development and review data.
-2 -1 0 1 20.0 0.5 1.0 1.5 2.0
x1x2

Risks 2019 , 7, x FOR PEER REVIEW 9 of 11
measure stability, the Population Stability Index ( 𝑃𝑆𝐼 ), interp rets this as the distribution of 351
development data and the distribution of review data are similar. This paper introduces the Prediction 352
Accuracy Index ( 𝑃𝐴𝐼) which takes a different perspective on this requirement. The 𝑃𝐴𝐼 requires the 353
predictive abil ity of the model on current data is not significantly worse than the predictive ability at 354
development. This perspective more suitably answers the key question of whether a model is still fit – 355
for-purpose. Both indices only examine the distribution of the i nputs to a model, so do not address 356
concerns such as calibration or discrimination of the model. While calibration and discrimination are 357
also important, they have the disadvantage that values of the response are required. This can make the 358
model input dat a old; at least one year ago if an outcome such as default over a one year outcome 359
window is used. In contrast, the 𝑃𝐴𝐼 and 𝑃𝑆𝐼 can be calculated on the characteristics of today’s 360
portfolio. This can provide an earlier warning that the model may no l onger be suitable for the current 361
customers. A high value of the 𝑃𝐴𝐼 may therefore require consideration of overlays above model 362
outputs to prudently provision for expected losses or capital, even if historical calibration and 363
discrimination were consid ered satisfactory. 364
The 𝑃𝐴𝐼 has several advantages over the 𝑃𝑆𝐼. First, the 𝑃𝐴𝐼 measures the predictive accuracy 365
of the model when applied to the review data rather than a generic difference in the distribution of 366
review and development data. The 𝑃𝐴𝐼 penalises a model when it is applied to review data beyond 367
the boundary of development data (extrapolation) but not when the review data is more concentrated 368
in the regions suited for the model. The latter is not uncommon; an example is when a new a pplication 369
scorecard replaces an old inferior application scorecard and thereby reduces the number of poor 370
applications accepted. This can reduce the variation in the types of customers accepted without 371
introducing customers that were not accepted previous ly (so no extrapolation is involved). 372
Second, the 𝑃𝐴𝐼 is directly applicable to explanatory variables that are numeric or categorical 373
(ordered or unordered). While the 𝑃𝐴𝐼 could be applied to variables bucketed into categories (for 374
example when the model applies this transformation of a numeric variable into a few categories), there 375
is no need to do so. Applying the 𝑃𝐴𝐼 to a raw, untransformed variable might also give important 376
insights. 377
Third, the 𝑃𝐴𝐼 does not suffer from calculation problems when categories have frequencies close 378
to zero in either development or review data. Both will give an infinite value (or undefined value) for 379
categorical data if a category has no observations at development but at least one observation at review, 380
but unl ess there is good reason to combine this category with another category then this conclusion is 381
not unreasonable as the model can not provide a prediction for such an observation. This issue does not 382
arise when the 𝑃𝐴𝐼 is applied to numeric variables. 383
Fourth, the 𝑃𝐴𝐼 can be applied to many explanatory variables simultaneously, thus revealing the 384
extent to which the review data involves extrapolation in a multivariate sense. This is arguably more 385
important than assessing the univariate distributions of each variable one at a time. A common attempt 386
to include multiple explanatory variables in the 𝑃𝑆𝐼 is to use the model output (score) rather than the 387
model inputs. This is similar to the simple linear regression case, equation (4), as the score is typi cally a 388
numerical value. In order to apply the 𝑃𝑆𝐼 the scores must first be assigned to categories. While this 389
approach does take into account all explanatory variables, it has several of the above disadvantages: it 390
requires creation of arbitrary catego ries; extreme scores will be placed into a category without taking 391
into account how extreme they are (possible extrapolation); and the 𝑃𝑆𝐼 considers these as unordered 392
categories when the score is clearly ordered. If too many categories are used the 𝑃𝑆𝐼 can be high due 393
to minor difference in frequencies and if too few categories are used important differences in the 394
distributions are not captured. More significantly, the 𝑃𝑆𝐼 only examines one dimension of the 395
multivariate design space (the one defi ned by the model coefficients) but deviations in other directions 396
are just as important from a model fit -for-purpose perspective. 397
Fifth, the 𝑃𝐴𝐼 is directly applicable to most model structures. For example, regression models that 398
include non -linear term s such as logarithmic transformations, quadratics or interactions between two 399
variables are handled naturally by the 𝑀𝑃𝐴𝐼. The 𝑃𝑆𝐼 requires some manipulation of variables that 400
may be unnatural, require careful thought, or have undesirable consequenc es. For example, it is difficult 401
to interpret the 𝑃𝑆𝐼 when applied to both a variable and its square in a quadratic regression model 402

Risks 2019 , 7, x FOR PEER REVIEW 10 of 11
and it is unclear how two variables involved in an interaction should be categorised for application of 403
the 𝑃𝑆𝐼. 404
Sixth, the 𝑃𝐴𝐼 can be applied without making any linearity assumptions considered appropriate 405
at model development that may no lon ger be valid. For example, including the square of each numeric 406
variable (as well as the variable itself) in the calculation of the 𝑃𝐴𝐼, even though the quadratic is not in 407
the model, will estimate population stability that considers the possibility tha t relationships are non – 408
linear. Due to the extra uncertainty of model predictions when the linearity assumption is relaxed and 409
extrapolation is involved, this will increase the 𝑃𝐴𝐼 when the review data has significant outliers 410
compared to the developmen t data. 411
It is recommend that the 𝑀𝑃𝐴𝐼 is used as the primary diagnostic for population stability. This 412
provides a single value to measure population stability and hence more concise reporting than would 413
be the case if a 𝑈𝑃𝐴𝐼 (or 𝑃𝑆𝐼) value was p resented for each variable. 𝑈𝑃𝐴𝐼 values may add insights 414
concerning which variables are responsible for instability should the 𝑀𝑃𝐴𝐼 indicate population 415
stability is low. Following these guidelines will produce monitoring reports that more concisely and 416
accurately assess whether a lack of population stability suggests the model under review is no longer 417
fit-for-purpose. 418
419
5 Conclusion 420
The auditing and monitoring of models to assess whether they remain fit -for-purpose is important, 421
and regulations wit hin the banking industry make it clear this is essential to ensure banks, auditors, 422
regulators and investors have confidence in model outputs. The Population Stability Index is an 423
industry standard to assess stability: whether the data the model is current ly applied to differs from the 424
data used to develop the model. This paper introduces the Prediction Accuracy Index which addresses 425
many deficiencies in the 𝑃𝑆𝐼 and assesses more precisely whether the model remains fit -for-purpose 426
by considering when rev iew data is inappropriate for the model rather than just different to the 427
development data. Adoption of the Prediction Accuracy Index as an industry standard will simplify 428
reporting and improve confidence in the use of credit models. 429
430
Author Contributions : 431
Conceptualization , methodology, and original draft preparation Ross Taplin; all other aspects including 432
software, validation, formal analysis, investigation, and writing – review and editing, Ross Taplin and Clive 433
Hunt. 434
Funding : This research received no external funding 435
Conflicts of Interest: The authors declare no conflict of interest. 436
437
References 438
Basel Committee on Banking Supervision (2006). Basel II: International Convergence of Capital 439
Measurement and Capital Standards, A Revised Framewo rk – Comprehensive Version , Bank for 440
International Settlements, retrieved 4 February 2018, from https://www.bis.org/publ/bcbs128.htm . 441
International Accounting Standards Board (2014). IFRS 9 – Financial I nstruments, retrieved 4 February 442
2018, from http://www.aasb.gov.au/admin/file/content105/c9/AASB9_12 -14.pdf 443
Jeffreys, H. (1946). An Invariant Form for the Prior Probability in Estimation Problems. Proceedings of 444
the Royal Society of London.Series A, Mathematical and Physical Sciences, 186 (1007), 453 -461. 445
Johnson, R.A. and Wichern, D.W. (2007). Applied Multivariate Statistical Analysis . 6th edition. Prentice 446
Hall. 447
Karakoulas, G. (2004). Empirical Validation of Retail Credit -Scoring Models. The RMA Journal, 87 (1), 56 – 448
60. 449
Lin, J. (1991). Divergence Measures Based on the Shannon Entropy. IEEE Transactions on Information 450
Theory, 37 (1), 145 -151. 451
Pyle, D. Data Preparation for Data Minin g. (1999) Academic press. 452

Risks 2019 , 7, x FOR PEER REVIEW 11 of 11
Ramsey, F.L. and Schafer, D.W. (2002). The Statistical Sleuth: A Course in Methods of Data Analysis , 2nd 453
edition, Duxbury Press, Pacific Grove, USA. 454
Sabato, G. (2010). Assessing the Quality of Retail Customers: Credit Risk Scoring Models. IUP Journal of 455
Financial Risk Management, 7(1), 35 -43. 456
Siddiqi, N. (2005). Credit Risk Scorecards: Developing and Implementing Intelligent Credit 457
Scoring. Wiley . 458
459
© 2019 by the authors. Submitted for possible open access publication under the terms
and conditions of the Creative Commons Attribution (CC BY) license
(http://creativecommons.org/licenses/by/4.0/).
460

Similar Posts