Master Thesis 201305912 [630646]
Master’s thesis
Author: Aurelian -Tudor Dumitrescu
Study no.: 201305912
MSc. Business Intelligence, School of Business and Social Sciences, Aarhus University
Supervisor: Ana Alina Tudoran
Department of Economics and Business Economics
Exam no.:
Number of characters: 127,000
Max no. of characters: 132, 000
THE STUDY OF CHURN IN A
FINANCIAL (MERCHANT
ACQUIRING) INSTITUTION: A
STATISTICAL/DATA MINING
APPROACH
AARHUS UNIVERSITY BSS, DEPARTMENT OF ECONOMICS AND BUSINESS ADMINISTRATION
Abstract
This obj ect of this paper was that of studying churn prediction in a B2B scenario involving a
merchant acquiring compan y. The approach taken was both a statistical and data mining
one. Namely, the paper used two statistical methods to predict churn: decision tree and
logistic regression. The data used for the prediction consisted of 6458 observations across
25 variables , including the classic RFM variables, and several other that have never been
experimented with in the specialised literature. The data mining appr oach used was that of
performing in depth processing and refinement of the data in order to build the parametric
model. The Fully Conditional Specification method with the Predictive Means Matching
model was used in order to impute the missing data. After the data refinement process took
place, the models were applied on an evenly balanced training set and evaluated a on test
set which displayed the true distribution of the classes in the dependent variable. The
logistic regression outperformed the decision tree on every criterion, especially on the
customi sed loss function created spec ifically for this case , displaying robust results . The
variables that had the highest importance in predicting churn were the customer’s account
currency, gateway provider, business model, customer country and refund number .
CONTENS OF THE THESIS
CHAPTER I – INTRODUCTION
1. INTRODUCTION ………………………….. ………………………….. …………………………. 1
2. RESEARCH QUESTION ………………………….. ………………………….. ………………… 3
3. DELIMITATIONS ………………………….. ………………………….. ………………………….
CHAPTER II – THEORETICAL FRAMEWORK
1. LITERATURE REVIEW ………………………….. ………………………….. ………………….. 5
CHAPTER III – METHODOL OGY
1. LOGISTIC REGRESSION ………………………….. ………………………….. ……………….. 8
2. DECIS ION TREE ………………………….. ………………………….. ………………………….. 9
CHAPTER IV – DATA PROCESSING
1. DATA OVERVIEW ………………………….. ………………………….. ……………………….. 17
2. DESCRIPTIVE STATISTICS ………………………….. ………………………….. …………….. 20
3. DATA CLEANING & REFINMENT ………………………….. ………………………….. …… 22
4. VARIABLE TRANSFORMATIONS ………………………….. ………………………….. ……. 30
5. CLASS IMBALANCE ………………………….. ………………………….. …………………….. 30
CHAPTER V -ANALYSIS & RESULTS
1. LOGISTIC REG RESSION MAIN EFFECTS ………………………….. ………………………. 32
2. DECISION TREE C50 ………………………….. ………………………….. ……………………. 38
3. MODEL COMPARISON & SLEECTION ………………………….. ………………………… 42
CHAPTER VI – FINAL OVERVIEW
1. MANAGERIA IMPLICATIONS ………………………….. ………………………….. ……….. 50
2. LITERATURE IMPLICATIONS ………………………….. ………………………….. ………… 51
3. LIMITATIONS ………………………….. ………………………….. ………………………….. … 52
4. FURTHER RESEARCH ………………………….. ………………………….. …………………… 52
REFERENCES ………………………….. ………………………….. ………………………….. 53
APPENDICES ………………………….. ………………………….. ………………………….. 61
LIST OF FIGUTES AND TABLES
Figure 1. ROC chart. Source: output in SAS Enterprise Miner.
Figure 2. Lift. Source: output in SAS Enterprise Miner.
Figure 3. Cumulative lift. Source: output in SAS Enterprise Miner.
Table 1. Dependent variable
Table 2. Category 1 of independe nt variables
Table 3. Category 2 of independent variables.
Table 4. Category 3 of independent variables.
Table 5. Training and test set distributions .
Table 6. Logistic regression main effects output.
Table 7. Significance of variables for logistic regre ssion with main effects ordered by
coefficient estimate
Table 8. Variable importance in C5.0 decision tree.
Table 9 . Variable importance in C5.0 decision tree.
Table 10. Model comparison
APPENDICES
Appenix 1.Variable description
Appendix 2.MCC names.
Appendix 3. Tabulated pattern
Appendix 4. Little’s test
Appendix 5. Pattern of missing value
Appendix 6. MI datas sets
Appen dix 7. Groups and pooled d ata
Appendix 8. Groups coefficient
Appendix 9. Variable distribution.
Appendix 10. VIF & Durbin -Waston
Appendix 11. Logit vs independent variables.
Appendix 12. Decision Tree C5.0
Page 1 of 106
CHAPTER I – INTRODUCTION
1. INTRODCUTION
“B2B” consumers, or Business -to-Business consumers, are private or public commercial
entities ( companies) which purchase goods or services from other commercial institutions
which offer the specific goods or services, in order to satisfy a business, commercial need.
These transactions are considered to be part of the Business -to-Business sphere of
commerce. Cla ssic examples of B2B commercial scenarios are logistic companies supplying
supermarket chains, consulting firms offering advice to other companies, among others
(Brennan, Canning & McDowell, 2014).
The B2B area of commerce is conceptually separate to the “B2C” one. The “B2C” scenario,
which stands for Business -to-Consumer , represents commercial instances where private
individuals purchase goods or services from commercial entities to fulfil a personal need
(Brenna, Canning & McDowell, 2014). Examples inclu de when a person goes to buy
medicine from a pharmacy, or food from a local grocery etc. There are however instances
were a company can have a mixed B2B/B2C business model for example automobile leasing
companies which leas their cars to both commercial an d private customers, or
telecommunication companies offering mobile phone subscription services to private
customers and broadband and telephonic infrastructures for enterprises, just to name a
few.
Besides the philosophical difference between B2B and B2 C, one aspect that really separates
the two areas of business is the impact of churn. Glady, Baesens & Croux (2009) define the
marketing phenomenon called churn, or attrition , also known as customer defection, as the
action of a customer, B2B or B2C custom er, to discontinue the business relationship it or he
has with the current supplier of goods or services. A person or company which has ended
their relationship with a company is called a churner. While churn exists in both the B2B and
B2C area, churn play s a more significant role for B2B providers because, from the papers of
Gordini & Veglio (2017) and Rauyruen & Miller (2007), B2B customers tend to:
1. Be fewer in number;
2. Have more frequent transactions i.e. greater number of transactions;
3. Have more financially valuable transactions.
The loss of a customer in a B2B context can lead to reduced sales and revenues and
increased costs for acquiring customers (Athanassopoulos, 2000; Risselada, Verhoef & Bijmolt
2010; Tamaddoni Jahromi, Stakhovych & Ewing , 2014) . In fact, the cost of acquiring a new
customer is considered to be approximately (approx.) five times higher than to maintain a
current one (Colgate & Danaher, 2000; Benoit & Van den Poel, 2012). Reichheld and Sasser
(1999) studied churn for a fin ancial institution and concluded that decreasing with 5% the
number of defections can produce 85% more profits (Benoit & Van den Poel 2012). Van den
Page 2 of 106
Poel & Larivière (2004) made a similar discovery to the one of Reichheld and Sasser (1990)
and argued that even a 1% decrease in churn can lead to important profits (Benoit & Van
den Poel, 2012).Furthermore, again in the case of companies offering financial services,
stopping customer from churning, in other words retaining them , is important because it is
considered that the longer relationship a customer has with the firm, the more valuable the
customer becomes to the company (Benoit & Van den Poel, 2009; Benoit & Van den Poel,
2012). Long -term customers purchase more, require less time and resources to be se rved,
and have a lower sensitivity to price fluctuations, and are also thought to help attract new
customers (Ganesh, Arnold & Reynolds, 2000; Reichheld, 1996; Benoit & Van den Poel,
2012). The reason why long -term customers require less resources to be se rved is that the
firm has already gained substantial knowledge on the customer, making the process of
serving smoother (Ganesh, Arnold & Reynolds, 2000; Benoit & Van den Poel, 2012).
Based on the above findings, it can be seen how important it is for co mpanies to detect
which customers are at high risk of churning in order to retain them. Customer retention is
considered to be the building block of maintaining and further expanding business
relationships (Gordini & Veglio, 2017; Eriksson & Vaghult, 2000; Kalwani & Narayandas,
1995). Customer retention is usually achieved via the implementation of appropriate
customer retention strategies. Customer retention strategies are, usually, incentives created
by the marketing or sales department of a company in th e form of discounts, advisory
information, besides other. Despite their costs, retention campaigns are believed to
generate more revenue and cost less than customer acquisition campaigns (Lam et al. 2004;
On the subject of applying retention strategies, a very important fact to account for is risk.
Those merchants that have the highest risk of churning should receive retention strategies
that are specific to their needs. Consequently , predicting which merchants are considered
more likely churn can help a f irm manage its marketing and sales activities and stimulate
revenues .
In order to accurately study churn, a prediction model must be designed which takes into
account relevant data such as the behaviour of the customer. Such a model should be able
to provi de information on the customers most likely to churn. One B2B company that was
interested in studying churn among its customers was Clear haus .
2. Company description
Clear haus A/S is a Danish merchant acquiring company based in Aarhus Denmark that was
esta blished in 2011 . Merchant acquiring represents the service of offering e -Commerce
businesses (e-shops)1 the opportunity to accept payments online when selling on the
Internet. They are sometimes called “banks” because they process the transactions of
businesses, keep it for a few days in customer’s account at the company , and t en
release /fund the merchant. It is necessary for a merchant to use the services of an acquirer
in order to have their payments processed ( Clear haus, 2018; Dumitrescu, 2018). The
reasons why it is necessary is for merch ants o have an acquirer are:
1 Alternatively referred to as “merchants” from now.
Page 3 of 106
1. Merchants usually do not possess the data infrastructure to process card data;
2. The risk of money laundering decreases by using an acquirer .
Clear haus operates mainly in Denmark where it represents the second largest comp etitor
after Nets A/S. Denmark is a tech -savvy as well as a very competitive market , as a result,
companies tend to have lower switching costs than in other countries due to the increasing
number of merchant acquirers ready to offer their services at compe titive prices and record
service times (Clear haus, 2018 ; Dumitrescu, 2018 ).
Despite the fierce competition, Clear haus holds a market share of roughly 20%. As a result,
the company has steadily ma de its transition from a start -up to a small -medium -enterpr ise.
In order to increase its market share, not only is it important for Clear haus to attract new
customers, but also to keep them. Recognising the importance of maintain good business
relationship s, Clear haus was interested in knowing what merchants are m ore l ikely to churn
and what actions could be taken in order to stop clients from leaving the company
(Clear haus, 2018; Dumitrescu, 2018) . As a result , this paper has the following research
question:
Research question: What factors can be used to predict c hurn?
This paper is set to take on both an approach which combines both statistical learning and
data mining concepts in order to answer the research question. Statistical learning will be
used in order to select (a) model(s) which will be able to be used for making the inferences
in the research question. Practices from data mining will be used to process and refine the
data used for building the model(s), as well as evaluating them.
3. Delimitations
First, this paper analyses strictly the data from the period 1st June 2016 to June 2018 that
was made available from Clear haus’ data warehouse. Data on the merchants from previous
periods will not be taken into account. As a result, the generalisation of the results of this
study will most likely face several challenges due to the fact that other possible pie ces of
data that could have been used in the analysis but were not accessible or usable.
Second, there are different definitions of churn in the literature about churn. Van den Poel
and Lariviére (2004) s aid that churn takes place when a customer has closed their account at
a company. Also, Buckinx and Van den Poel (2005) discuss the concept of partial defection
which refers to customer steadily moving in time parts of their business from the company
to an other provider before total defection. In this paper, a merchant is considered to have
churned if the status of the merchants account is set to “inactive” (closed). However,
churning does not always take place on the date the account was closed. Recording that a
merchant has churned is mostly a physical process performed by the employees’ part of the
support group in Clear haus. It is very possible for a delay to exist between the date the
merchant has churned and the imputation in Clear haus’ system. Further more, merchants do
not always notify Clear haus that they have stopped using its services. Despite there being a
Page 4 of 106
contract signed between Clear haus and the merchant, the merchant is not obligated to
produce any transactions. Also, Clear haus reserves the righ t to terminate the contract with
any merchant if more than 180 (6 months) have passed since the merchant incurred a
transaction. However, this practice does not always take place. As a result, the churn
analysis in this paper can be said to take place in a semi -contractual setting, although in the
specialised literature the scenario would have been non -contractual (Tamaddoni Jahromi,
Stakhovych & Ewing, 2014).
CHAPTER II – THEORETICAL FRAMEWORK
Despite the analysing of churn in B2C settings being a very p opular subject in research
literature, performing it in a B2B scenario encounters the following challenge: the scarce
number of specialised research papers dedicated to the topic (Gordini & Veglio, 2017;
Tamaddoni Jahromi, Stakhoych & Ewing, 2014). This li mitation has been brought to light by
several past literature reviews such as the ones of Martínez -López & Cassilas (2013),
Tamaddoni Jahromi, Stakhovych & Ewing (2014), Wiersema (2013), Yu et al. (2011) and
Gordini & Veglio (2017). This fact leads to diff iculties in selecting an appropriate churn
prediction model s (Gordini & Veglio, 2017). One possible reason behind the lack of research
on the topic is considered to be the scarcity of developed data provided to researchers
(Tamaddoni Jahromi et al. 2014, W iersema, 2013; Gordini & Veglio, 2017). Moreover, the
literature on the analysis of churn in a merchant acquiring firm can be considered to be
even more scarce. At the time this paper was written, it was not possible to locate any
papers written on such to pic.
In spite of the above limitation, this paper will contribute to the research literature not only
by studying churn in a B2B scenario, but in a setting, which has not been written before.
Furthermore, while this study will rely on methods and types of data already proven to be
useful in churn analysis in B2C scenarios , the second contribution will be that of developing
a statistical model with a collection of new data , specific to an acquirer.
Page 5 of 106
Literature review
Recency, Frequency, Monetary (RFM)
RFM represent a collection of traditional variables used in defining customer behaviour .
According to Blatttberg, Kim & Neslin (2008) , each variable represents the following:
Recency – represents the difference in the number of days (usually in days ) between the
time the last transaction took place and the end of a specific period (observation period)
Frequency – denotes the number of transactions/purchases made by a customer during a
specific time period
Monetary – represents the total value of transa ctions/purchases made by a customer during
a specific time period
The framework of RFM has been extended to include also length (Blattberg, Kim & Neslin,
2008):
Length – the amount of days that a customer has been with the provider.
RFM -L variables have b een proven consistently to be important predictors of churn (Buckinx
& Van den Poel, 2005; Coussement & De Bock, 2013; Jahromi, Stakhovych & Ewing 2014).
However, the use of these variables has also been used in combination with other types of
variables. On the subject of customer churn prediction, the categories of demographic and
behavioural variables been used extensively throughout the specialised literature. The data
behind demographics such as location, name etc. and those regarding actions can be
transposed in the B2B context in the form of variables which describe the customer’s profile
i.e. headquarters, business model etc. as well their financial information from their account
(Chen et al. 201 5; Verbeke et al. 2012).
1. Decision tree
Decision tr ees represent some of the popular model for studying churn, especially in B2C
context due to its lack of sensitivity regarding the distribution between dependent and
independent variables.
Chen et al. (201 5) used logistic regression as well as a C4.5 deci sion tree to study churn in a
B2B setting in a logistic company from Taiwan. In their analysis, besides the classic RFM
variables which were part of the transaction behaviour category . they also added L (length))
and the customer profile category (an equiv alent to the demographic category used in B2C,
an example of a variable was location), the authors created a third category which reflected
the quality of service of the company. This third category, though specific to the industry,
represented a novelty i n data used for predicting churn. The variables that were found to
carry the greatest importance in predicting churn were recency, length and monetary
(measured as and average). Namely, the greater the number of days since the last
transaction, and the sho rter the relationship, as well the smaller the value provided to the
Page 6 of 106
logistic company during the observation period, the higher the probability of that customer
churning.
Chye & Gerry (2002) had a decision tree as the best performing model in analysing of churn
for a n unnamed bank. They compared the mode l against a logistic regression and a neural
network. The data used in the models consisted of 14 variables ranging from demographics,
account balance information and also the types of payment cards used. T he performance of
the models was tested using the AUC (area under the curve) . The demographic variable for
the gender of the person had the highest importance in predicting churn while card
information was not found to be significant at all. On the subject of socio -demographics,
Benoit & Van den Poel (2012) discovered that besides the RFM variables, personal profiles
are very important predictors of churn.
Kumar & Ravi (2008) studied the prediction of credit card churn in a Latin American bank.
Besides dem ographic information, they also included transaction information such as card
detail and log -in and log -out timestamps. They used a C5.0 decision tree, logistic regression,
as well as random forests , support vector machines and other. Using AUC as the crit erion,
they observed that the decision tree outperformed logistic regression. The decision’s tree
performance was just as solid as that of a random forest.
Within the industry of electronic banking services Keramati, Ghaneei & Mirmohammadi
(2016) used a d ecision tree in order to predict churn among a bank’s user s. The variables
that were used were customer demographics, transaction history such as the total value
and frequencies of several types of transactions, as well as length of the relationship which
was considered to reflect the level of satisfaction among users. The methods of evaluation
were accuracy, precision as well as recall and the F -measure. The decision tree registered an
accuracy of almost 0.93. The most important variables used to predict c hurn were the
several types of frequencies and length of the customer relationship.
2. Logistic regression
Logistic regression is a classification technique that has been used extensively throughout
the literature for studying churn, both in the B2C and B2B contexts due to its high level of
interpretability, operability and robustness (Buckinx & Van den Poel, 2005; Tamaddoni
Jahromi, Stankhovych & Ewing, 2014). For example, Tamaddoni Jahromi, Stankhovych &
Ewing (2014) used, among others, logistic regress ion in predicting churn using transactional
data in the form of RFM from an Australian online Fast Moving Consumer Goods (FMCG). In
comparison to the several decision trees created , the logistic regression outperformed them
on both criteria : AUC and cumul ative lift. The most important variables that were said to
influence the probability of churning among customers were: recency and frequency. The
variable monetary was significant but had la decreased impact. They concluded that the
greater amount the days since the last transaction (recency) and larger number of
transactions were found to increase the likelihood of customers to churn .
Page 7 of 106
Nie et al. (2011) predicted churn in the context of a Chinese bank. The variables used by the
researchers were: customer i nformation, card information, risk information, and
transaction al (monetary value of the transaction s across several months, frequency of the
transactions etc.) . At their time, according to them, Nie et al. (2011) were among the first in
studying the effec t of card information on customer churn . Besides logistic regression, they
also utilized a C4.5 tree. Logistic regression slightly outperformed the decision trees in terms
of accuracy and AUC. The variables that were found to be the most significant for pr edicting
churn were those pertaining to card information, and transaction information e.g. the
lower the value of monetary (written as total or averages) was, the more likely the customer
was to churn. Most of the demographic variables e.g. location, age etc. were found to be
statistically significant but contributed very little to explaining churn.
Gordini & Veglio (2017) also used logistic regression as a benchmark against neural
networks and support vector machines where churn was studied in the B2B co ntext of an
Italian e -commerce company. The variables used were the RFM ones, as well as
demographics and length. The models were evaluated using accuracy, AUC and top -decile
lift. Despite slightly outperformed by the support vector classifier which was be tter at
handling non -linearity, logistic regression had a very high score across the criteria. The
logistic regression, just as the other models, returned the fact that higher levels of recency
led to greater probabilities of churning. Also, the demographi c variable “age” was found to
be the second most important variable in predicting churn. While frequency, monetary and
length played smaller roles in predicting churn.
3. Miscellaneous and combinations
Verbeke et al. (2012) studied churn in a B2B context within the telecommunications
industry. Using over 15 types of methods to predict churn, including C4.5 decision trees and
logistic regression. They discovered using a multinational data that socio -demographic
variables as well as financial information on the account of customers represent important
predictors of churn.
Larivière & Van den Poel (2005) studied custome r defection within the context of a financial
services provider in Belgium. In their study , logistic regression was used alongside random
forest on variables such customer demographics and transactional information. The results
of the study indicated that variables such as socio -demographics and card -related have a
high capability o f predicting fraud.
Hadden et al. (2005) in their paper state that decision trees and regression are the most
popular statistical (data mining) techniques used for predicting churn. As it can be seen in
the literature review above, these two methods have been together in papers for predicting
churn. The next phase of this paper is to present the theory behind the two consolidated
techniques . Next, these methods will be u sed in a B2B context i.e. merchant acquir ing, in
order to answer the research question.
Page 8 of 106
Chapter III – METHODOLOGY
Classification
Classification rep resents the statistical task of predicting a dependent class variable e.g.
binary variable, while regression predicts numeric dependent variables (Hastie, Tibshirani &
Friedman, 2009). Since the task of this paper is that of predicting which customers of
Clearhaus are more likely to churn, it can be reconfigured into a binary variable which take s
on the values of “1” if merchants are churners, and “0” if they are not. There are several
methods used for classification, yet this paper will discuss only two.
1. Logistic regression
Logistic regression is a parametric statistical technique used in classification problems. In
comparison to other techniques, logistic regression does not predict an observation’s
qualitative response, or, in other words, does not as signs an observation to a specific class
or category, but rather it predicts the probability of each category of the qualitative
variable. Dependent variables in classification are usually dichotomous, i.e. they have two
categories as response va lues (Jame s et al. 2013). The dependent variable in this study is the
status of the merchant (churner/not churner) and is represented with Y. Y can take on only
one of two classes at a time:
Y= 1 – account status “closed” i.e. customer is a churner/has churned
0 – account status “live” i.e. customer is not a churner/has not churned.
If simple regression was to be used, then it would be possible for the model to predict
negative probabilities for independent variables which have values close to zero, also, fo r
very large independent variable values, the simple regression could return probability values
greater than 1 (James et al. 2013). Since Y can strictly take only 1s and 0s, logistic regression
is chosen. The inappropriateness of using simple regression in stead of logistic regression
could also be seen by analysing the shape of the curve of the function. The curve of a linear
function is straight, thus the reason for returning for example negative probability values in
the dependent variable, whereas a logi stic function’s curve is in the shape of an “S”, curve
which does not cross the 0 -1 probability boundaries (James et al. 2013). However, in simple
logistic regression, the variance in the dependent variable is explained with the help of only
one predictor (independent) variable. Furthermore, since this paper uses a variety of
independent variables, the method of multiple logistic regression is used. The formula for
logistic regression is:
Page 9 of 106
The part on the left side of the expression log(p(x)/1 -p(x) repre sents the log -odds or logit. It
can be seen that the logit in Equation 1 is linear in x. This represents one of the assumptions
of logistic regression (James et al. 2014).
There are several assumptions regarding logistic regression that must be met.
First of all, multicollinearity i.e. the degree to which the group of independent variables are
correlated , must not be present. For very high levels of multicollinearity, it becomes
increasingly difficult to assess the effect that one individual variable has o n the response
variable. To detect multicollinearity, the correlation coefficient matrix can be used, or the
variance inflation (VIF), among other. The correlation coefficient matrix displays the
correlations between variables and any values above 0.9 sign als multicollinearity (Al-
Ahmadi, Al -Ahmadi & Al -Amri, 2014). VIF is the ratio between the variance of the coefficient
of a variable when the entire model is fitted, to the variance of the coefficient if fit alone.
The minimum value of VIF is 1, value whic h specifies that there is zero multicollinearity.
Realistically, there is almost always a small degree of correlation between the variables. The
rule of thumb is that if a VIF value is greater than 10 (sometimes 5), then the level of
multicollinearity pres ent is too high and no good information on the effect of individual
variables can be obtained (James et al. 2013).
Second, the assumption regarding the independence of cases/errors must be fulfilled. It
refers to how properly the data was collected. The i ndependence of cases/errors can be
measured by calculating the Durbin -Watson test. For the cases/errors to be independent,
the Durbin -Watson test should return a score between 1 and 3 (Al -Ahmadi, Al -Ahmadi & Al –
Amri, 2014).
Third, in logistic regression, it is necessary to have sufficient outcome cases to independent
variables, generally, a ratio of minimum 10:1 is used in the specialised literature, as a result,
a proper sample size and collection of predictors must be used in an experiment
(Ottenbacher e t al. 2004).
Page 10 of 106
Fourth, significant outliers and influential observations must not be present in order to
obtain a stable, good model. To check for significant outliers, the standardized residuals
must be inspected. The value of the standardized residuals mu st not be greater than the
absolute value of 3. However, not all outliers are influential points. Both , however , can
create instable models and need to be removed. Influential points can be determined by
calculating Cook’s distance values (Al -Ahmadi, Al -Ahmadi & Al -Amri, 2014; Field, 2000;
Sarkar, Midi & Rana 2011).
Fifth, in the case of continuous independent variables, there has to be a linear relationship
between them and the log -odds of the dependent variable, otherwise the simple logistic
regression must be adapted to include customizations such as including interactions and/or
polynomials or other methods used ( Al-Ahmadi, Al -Ahmadi & Al -Amri, 2014; James et al.
2013).
2. Decision tree s
The decision tree, in comparison to logistic regression, is a no n-parametric (i.e. does not
make any assumptions regarding the distribution of the data) classification (decision trees
can also be used for prediction and they are called regression (decision) trees) technique
used for dividing a large collection of data into progressively smaller and more similar
groups with regards to the dependent variable (Berry & Linoff, 2004).
A decision tree is constructed out of the following parts: nodes, branches and leaves. A
decision tree is grown starting from the root node. T he root node represents the input
(independent) variable which best classifies the dependent variable, or, in other words, is
the node that has a large collection of records with regards to the dependent variable . Then,
the root node determines the best sp lit. By “best split” it is meant the division of data that
has highest performance in separating the data from variables into groups, groups in which
one class predominates. Best splits continue and lead to new nodes, also called child nodes,
until the ter minal nodes have been reached and the decision tree has been fully grown
(Berry & Linoff, 2004). Nodes are connected to each other via branches. Branches besides
connecting the nodes also express the splitting criteria of the nodes (James et al. 2013).
From the root node to each leaf/terminal node there is a distinctive path. This path
describes the rule(s) used for classifying the dependent variable (Berry & Linoff, 2004). Rules
extracted from decision trees have a high degree of interpretability, and thi s is one of the
main reasons why decision trees are so popular among researcher (Nie et al. 2011; Mitchell,
1997). Another very possible reason is that decision trees are non -parametric tests, are not
affected by skewed distributions or outliers, and also do not have any difficulties in running
with variables which have missing values, not to mention the great ease of using the
technique (Hastie, Tibshirani & Friedman, 2009; Berry & Linoff, 2004).
Growing a decision tree is done by using successive binary s plitting with the purpose of
making the best splits. The best splits are those that increase the most the purity in a node.
A low purity means that a node holds observations from multiple classes, while a high purity
stipulates that a node contains mostly observation from only one class. In addition, the best
splits are also the ones that produce nodes of equivalent size. Splitting continues until the
Page 11 of 106
division process stops improving significantly the purity in the nodes, or when the amount
the data in a no de arrives at a determined lower limit, or when the tree arrives at a certain
depth limit set in advance (Berry & Linoff, 2004; James et al. 2013).
Regression trees use RSS (residual sum of squares) as a basis for performing binary splits.
Classification t rees cannot use RSS and instead use the classification error rate as a criterion
for executing the splits. Because classification trees allocate an observation that is located in
a certain region , to the class of observations from the training data set tha t occurs most
usually in that specific region, the classification error rate is the percentage of the
observations present in that region that are not part of the class which was the most
common. However, the classification error rate is not sensitive eno ugh in order to grow
decision trees, and instead the Gini index (population diversity) or cross -entropy
(information gain) are the procedures used for assessing the quality of the split(s) (James et
al. 2013).
The Gini index represents a measure of the var iance among the X classes. If all of the ’s are
in the vicinity of zero or one, the Gini index will have a small value. A small value of the Gini
index represents high node purity (good split), while a big value represents low purity (not
so good split) (B erry & Linoff, 2004; James et al. 2013). Cross -entropy is an alternative to the
Gini which for which a low cross -entropy score is a sign of high purity in the node (James et
al, 2013).
As mentioned earlier, splitting continues until the full decision tree has been developed.
However, a fully grown decision tree using the training dataset is not necessarily the one
that performs the best in classifying new data. From the root node to the leaf nodes, nodes
get reduced in size. During this reduction phase, sma ller nodes tend to adopt more the
particularities of the data entries in the training sample. With other words, as nodes gets
smaller, the tree starts to overfit the data, which may lead to erroneous predictions. Instead
of having a very big decision tree with many nodes, an alternative is to merge the smaller
terminal nodes using the pruning process (Berry & Linoff, 2004). Despite pruning being able
to reduce the level of complexity in a decision tree , thereby reducing its variance, it also
brings a small increase in bias. Regardless, pruning improves the performance of the tree on
the test set (James et al., 2013).
A tree can be pruned using the CART or C5.0 algorithms (Berry & Linoff, 2004; Hastie,
Tibshirani & Friedman, 2009; James et al. 2013).
CART repeatedly prunes the decision tree in order to identify possible subtrees. It first
prunes the branches that provide the smallest supplementary predictive power for each leaf
node. CART does this using the adjusted error rate. The adjusted error rate increas es the
misclassification rate of every node on the training sample by setting a penalty score for the
complexity derived from the amount of leaves present in the decision tree. The branches
that have misclassification rates greater than the penalty score a re set for pruning by the
adjusted error rate. By pruning the branches, subtrees that contain the root node are
produced, and then those subtrees’ adjusted error rate is compared to the one of the fully
Page 12 of 106
grown tree. The performance of the trees is tested us ing the test dataset. The tree that has
the highest performance is selected as the best model. The CART algorithm uses the Gini
index as the splitting criteria (Berry & Linoff, 2004; Patil, Lathi & Chitre 2012).
C5.0 represents the successor of the C4.5 a lgorithm. As CART, C5.0 starts by growing the
tree and then overfits the data to the tree, after which it prunes the tree to create a better
model. But, this algorithm employs pessimistic pruning. It prunes the decision tree by
checking each node’s error r ate and assumes that the true error rate is really high. Assuming
that A/B is the error rate for a node, (where B are the observations that reach the node, and
A are the observations that have been incorrectly classified), C5.0 presumes that this error
rate is the lowest one possible. The algorithm estimates the highest error rate that can be
expected at a terminal node (leaf) by using an equivalent to statistical sampling. There are
several formulas that determine the significance of observing A events in B trials. Most
notably, a certain formula expresses the confidence interval as the range of values A is
expected to take. C5.0 makes the assumption that the number of errors that have been
observed in the training dataset are part of the low end of the ab ove-mentioned range, and
replaces the high end to obtain the predicted error rate of a leaf, A/B on new data. As the
node decreases in size, the error rate increases. Pruning of child nodes takes place when
their estimated errors are higher than a node’s h igh-end error estimate (Berry & Linoff,
2004) . C5.0 performs multiway splits on categorical variables and its splitting criteria is
cross -entropy (Berry & Linoff, 2004; Patil, Lathi & Chitre, 2012).
In conclusion, when using the decision tree method, one m ust choose among the ways in
which a decision tree is grown (splitting & pruning).
Based on the literature review and as mentioned earlier, the methods selected for
answering the research question will be 1. logistic regression and 2. C5.0 decision tree. The
reason why the CART method was omitted was due its scarce utilisation (Chen et. al, 2015;
Chye & Gerry, 2002; Huang, Kechadi & Buckley, 2012 ; Wei & Chiu , 2002).
Now that the the theory behind the models used in classification has been presented , it ha s
to delc it In order to be able to relate this paper to the other studies on churn, the C5.0
algorithm for growing a tree was selected as the model to be compared to the logistic
regression.
3. Model performance evaluation
In order to select a model from a collection of models, their performances must be
evaluated using a common measure, or several ones. In literature, there are a variety of
metrics that can be used for evaluation. It is at the latitude of the researcher to consider
what criteria are appr opriate in the context of problem at stake. In this paper, 6 metrics
have been used for evaluating the models’ performance. Before presenting each criterion, it
Page 13 of 106
is important to mention the concepts, notations and formulas that are needed in order to
define the metrics in this paper:
1.TP (true positives) − represents the number of correctly classified positives cases, in this
context, the number of churners that were correctly classified as having churned ;
2.TN (true negatives) − denotes the number of correctly classified negative cases, in this
context, the number of non−churners which were classified as not having churned ;
3.FN (false negatives−Type I Error) − the number of incorrectly classified negative cases, in
this context, the number of churners which were incorrectly classified as being non−churne rs;
4.FP (false positives−Type II Error) − constitutes the number of incorrectly classified positives
cases, in this context, the number of non−churners that were incorrectly classified as being
churners ;
5.P=TP+FN ;
6.N=TN+FP ;
7. P(P/(P+N) = Churn rate;
8. TN/N= Specificity;
9. 1- Specificity= FP/N= False Positive Rate ;
10. Sensitivity= TP/P= True Positive Rate (recall);
11. Yrate=(TP+FP)/(P+N);
12. Precision= TP/(TP+FP) ; (Burez & Van den Poel 2009, p. 4627).
1. Percentage of correctly classified/accuracy
The percentage of correctly classified (PCC), also known as accuracy, is the ratio of the cases
classified correctly to the sum of all cases that needed to be classified. Researchers have to
strive to have models with increased accuracy scores. Across lit erature, accuracy represents
the most used criterion for evaluating the performance of a classifier model (Nie et al. 2011;
Coussement & Van den Poel, 2008; Drummond & Holte, 2005).
The formula for accuracy is:
Accuracy= (TP+TN)/(TP+TN+FP+FN ) (Chen, Hu & Hsieh 2014, p. 485).
2. Misclassification Error Rate
Misclassification error rate represents the ratio between sum of the total of cases classified
incorrectly and the sum of all cases. Researchers always have to strive to use models that
evidentl y, reduce the misclassification error rate. Based on Burez & Van den Poel (2009, p.
4627 ), the formula for the misclassification error rate is:
Page 14 of 106
Misclassification Error Rate=(FP+FN)/(FP+FN+TP+TN)= 1 – Accuracy (Burez & Van den Poel,
2009, p.4627).
3. Mea n Squared Error (MSE)
The mean squared error measure s how well did a predicted value of an observation
compare to its correct value , after using a statistical model such as logistic regression. Its
formula is the following:
𝑀𝑆𝐸 =1
𝑚∑(𝑦𝑖−𝑔̂(𝑥𝑗)2 (𝐽𝑎𝑚𝑒𝑠 𝑒𝑡 𝑎𝑙. 2013 ,𝑝.30)𝑚
𝑗=1,
where 𝑔̂(𝑥𝑖) represents the prediction that 𝑔̂ returns for the 𝑗𝑡ℎ observation. The
lower the value of the MSE, the closer the predict ed values are to the real responses . In a
model, a low MSE is an indication of low variance of prediction. A model should strive to
have low values of MSE to be considered a good model in terms of variance , both on the
training and testing set (James et al . 2013).
3. ROC and AUC
In order to measure the performance of classification models, the receiver operating
characteristic curve (ROC) is calculated and placed on its graph. The ROC is the trade -off
between the amount False Positives and the number of T rue Positives (Miguéis et al. 2012).
The ROC graph has two dimensions: the False Positive Rate (1 -Specificity) located on the X –
axis and the True Positive Rate (Sensitivity) is located on the Y -axis. The classifier models are
placed on the ROC graph accord ing to their results (Kumar & Ravi, 2008). After, the area
under the ROC curve (AUC) is used as a performance criterion. If a model has an AUC close
to 1.0 then the model discriminates between the classes very well. A model that has an AUC
close to 0.5, it is considered to have unsatisfactory discrimination. Basically, the greater the
AUC value a classifier has, the better it performs (Miguéis et al. 2012; Hanley & McNeil,
1982). The formula for the AUC is the following:
𝐴𝑈𝐶 =∫𝑇𝑃
𝑃1
0𝑑𝐹𝑃
𝑁=1
𝑃∗𝑁∫𝑇𝑃𝑑𝐹𝑃𝑁
0 (Burez & Van den Poel, 2009, p. 4628) .
AUC has been used extensively throughout the customer relationship literature (CRM) to
measure a model’s performance, like in the works of Coussement, Benoit & Van den Poel
(2010), Hill, Provost & Volinsky (2006) and Lemmens & Croux (2006). AUC has been used in
additional data -mining contexts as well, like in Takahashi, Takamura & Okumura (2009) and
Benoit & Van den Poel (2012). One of the possible reasons why AUC may be so popular is
that models with different predictor s can be compared (Burez & Van den Poel, 2009).
Page 15 of 106
4. Lift and cumulative lift
Lift is a measure of how well a model performs in comparison to using a random selection
and can be represented with a curve. Th e formula for lift is:
Lift=Precision/P/(P+N)=Sensitivity/Yrate (Burez & Van den Poel, 2009, p. 4628).
Regarding score, a good model should have a lift value greater than 1, 1 representing the
score of random selection produced by a random mode l. Lift is considered to be a derivation
of the ROC curve (Tufféry, 2001). While both AUC and lift bring complementary information,
AUC can be used to assess the accuracy of a model without taking into consideration the
churn rate, unlike lift. Nevertheles s, by using both metrics and not only, the performance of
models can be thoroughly investigated (Burez & Van den Poel, 2009; Benoit & Van den Poel,
2012).
Lift can concentrate only on the most important x percent (percentile) of the customers that
have th e highest risk of churning as well . The top x percentile , which differs from context to
context, is considered to be a very important segment on which a company should perform
retention actions. The formula for the top -percentile lift is:
𝑇𝑜𝑝 −𝑝𝑒𝑟𝑐𝑒𝑛𝑡𝑖𝑙𝑒 =𝜗𝑥%̂
𝜗 (Lemmens & Croux, 2006, p. 280);
where 𝜗𝑥% represents the proportion of churners found in the top x percent segment of
customers most likely to churn and 𝜗𝑥% denotes the proportion of churners found in the
entire test set ( Berry & Linoff, 2004; Tamaddoni Jahromi, Stakhovych & Ewing, 2014;
Lemmens & Croux, 2006).
Like in normal lift, the higher the lift score the better. As an example, a classifier that returns
a top -10% lift, i.e. top -10-percentile lift, with a score of 2 recognizes two times more
churners part of the top 10% riskiest segment of customers in comparison to a random
assignment/model (Benoit & Van den Poel, 2012; Coussement & Van den Poel, 2008).
In addition, the top -percentile lift is considering appealing due to the fact that it assumes
that marketing budgets are constrained. Thus, initiatives to decrease churn in the likes of
direct marketing campaigns, are reserved for the segment of customers that have the
highest risk to churn (Benoit & Van den Poel, 2012; Coussement & Van den Poel, 2008).
Lift and top -percentile lift are considered to be some of the most popular performance
measures in data mining, alongside accuracy and AUC (Neslin et al. 20 06).
4. The loss function
Another method for comparing the performance of classifiers is the loss function. The loss
function quantifies the losses (costs) a company may suffer in case of bad predictions
(Glady, Baesens & Croux, 2009). In the context of c hurn, the costs associated with
Page 16 of 106
misclassification are related to the marketing budget of a firm. The cost function is in
contrast to the accuracy measure. The reason for this is that PCC considers all
misclassifications to carry the same costs, however, th is is seldom true. In addition, PCC
display high sensitivity when it come s to class distribution, as well as when deciding on the
cut-off value to use for mapping the classifier output in relation to classes (Glady, Baesens &
Croux, 2009). Returning to the loss function, it can be easily argued that not all
misclassifications carry the same cost. In business contexts, it is quite common to see the
case that a Type I error (false negatives) carries more weight than a Type II error , and the
reciprocal in cert ain instances . Moreover, the difference in loss between the two has even
been approximated by researchers. Nie et al. (2001) on Lee et al. (2006) stated that the
losses produced by a Type I I error are usually 5 -20 greater than the losses from a Type I
erro r. However, the opposite can take place too.
In the context of churn, this would be equivalent to stating that misclassifying potential
churners as non – churners, thus not attempting to perform any retention actions, is costlier
than performing marketing retention campaigns on customers who were predicted to be in
danger of churning, while they did not have such an intention. This type of scenario appeals
to reason and it is most likely the rule, rather than the exception.
A loss function can be computed i n different ways. For example, using the customer lifetime
value (CLV). According to Gupta, Lehmann & Stuart (2004) customer lifetime value is equal
to the expected total of discounted earnings to be made in the future. Glady, Baesens &
Croux (2009) develo ped loss function using CLV. They determined that their loss function
was equal to the earnings a firm obtained without attempting to stop customers from
churning, minus the sums that would have been secured if the firm would have been able to
stop the cus tomer from churning (Glady, Baesens & Croux, 2009). Another slightly different
approach to modelling the loss function is that of constructing cost curves in order to
visualise the performance of the models over a series of misclassification costs (Drummon d
& Holte, 2006 ). However, usually, loss functions are created for the purpose of measuring
the misclassification cost produced by the error of the model, using a formula constructed
out of basic economic figures such as the marketing budget of a company and the average
profit generated per customer , as in Nie et al. (2011).
The cost function in this research paper was inspired from the work of Nie et al. (2011) but
adapted it by assuming that Type I errors are costlier than Type II errors . In their study o f
churn, they constructed a loss function using: the churn rate, the Type I and Type II errors,
the marketing budget, and the average profit per customer. As a result, the formula for the
loss function in this paper was the following:
𝐿𝑜𝑠𝑠 =𝐶𝑡1𝑃𝑎𝑣𝑔+𝑀𝑏
𝑁𝑡2+𝐶(1−𝑡1)𝑁𝑡2 (Nie et a. 2011, p.15275);
C represents the number of customers that churned (churners); N is the number of
customers that did not churn (non -churn ers); 𝑀𝑏 constitutes the total marketing budget,
Page 17 of 106
which is calculated: number of merchants x average marketing budget per merchant ;
𝑃𝑎𝑣𝑔 denotes the average profit per merchant; 𝑡1 is the Type I error rate, while 𝑡2is the Type
II error rate.
The first component of the loss function: 𝐶𝑡1𝑃𝑎𝑣𝑔 represents the loss produced by the Type
I error. This misclassification leads a company not to take any retention actions due to the
fact that potential churn merchants were considered not to be at risk of churning.
The second component of the loss function 𝑀𝑏
𝑁𝑡2+𝐶(1−𝑡1)𝑁𝑡2 is the loss generated by the
Type II error. The part on the left 𝑀𝑏
𝑁𝑡2+𝐶(1−𝑡1) represents the average marketing cost
allocated to an individual customer. 𝑁𝑡2 denotes the number of non -churners misclassified
as churners.
For Clear haus, the average profit per merchant was calculated to be 141.69 euros (, and the
average marketing budget per merchant was 14.85 euros .
Page 18 of 106
Chapter IV – DATA PROCESSING
1. Data description
The merchant acquiring company Clear haus provided the data for this paper, data which
was extracted from their internal data warehouse. All of the data is aggregated at the
merchant level. T he entire data set contained initially contained 6458
observations/customers (churners plus non -churners) distributed across 25 independent
variables part of three categories, and one dependent variable. 19 were interval variables, 6
nominal and 2 binary v ariables (including the dependent variable). The categories for the
variables were based on the paper from Chen et al. (2015) , Gordini & Veglio (2017) and
Tamaddoni Jahromi, Stakhovych & Ewing (2014 ).
The values for the variables were recorded in the per iod 1st of June 2016 to 1st of June
2018, i.e. an observation period of two years. The reason why previous data has not been
considered in this study is because the company’s data warehouse undertook
reconfigurations at the end of May 2016 (Dumitrescu, 201 8). As a result, specific data was
lost or altered, and it can be considered that consistent measurement and reporting took
place from June 2016. The collection of the variables for this study can be seen the following
tables below.
Dependent variable
Variable name Variable
type Description
account_status
Binary Returns the value of 0 for merchants who were still
active (non -churners) at the end of the study period.
Returned 1 for merchants who were considered to have
stopped being active by the e nd of the study period.
Table 1. Dependent variable. Source: Own -making.
Page 19 of 106
Category 1. Merchant profile
Number Variable name Variable
type Description
1. merchant_mid Nominal Unique customer
(merchant) identifier.
2.
merchant_mcc Nominal The i ndustry in which the
merchant was active.
Clear haus’ merchants are
divided by groups. The
exact list of gourps can be
seen in Appendix 2.
3. merchant_country Nominal The country of origin of the
merchant e.g. DK
(Denmark), GB (Great
Britain) etc.
4. merc hant_currency Nominal The currency of the
merchant’s account e.g.
euros, Danish crowns etc.
5. gateways_num Interval The total number of
gateways used by a
merchant. A gateway
represents a partner
Clear haus uses to bring in
customers.
6. gateways_name Nominal List of the names of
gateways used by the
merchant.
7. physical_delivery
Binary Whether the merchant
delivers goods/services
physically.
Table 2. Category 1 of independent variables. Source: own -making.
Page 20 of 106
Table 3. Category 2 o f independent variables. Source: Own -making.
Category 2. Transaction beh aviour
Number Variable name Variable
type Description
1. recency Interval Total n umber of days since
the last transaction.
2.
frequency Interval Total n umber of
transactions performed by
the merchant and
processed by CLear haus .
3. monetary Interval Total value of transactions.
4.
length Interval Total n umber of days of
the business relationship
with the merchant.
5. rec_trans_num Interval Total number of recurring
transactions.
6. 3d_trans_num Interval Total number of 3 -D
Secure transactions.
7. credit_num Interval Total number of credit
cards used in transactions
made at the merchant.
8.
debit_num Interval Total number of debit
cards used in transactions
made at the merchant.
9.
mastercard_num Interval Total number of Master
Card cards used in
transactions at the
merchant.
10.
visa_num Interval
Total number of Visa cards
used in transactions at the
merchant.
11.
other_debit_credit_num Interval Total number of other
types of payment cards
than debit or credit.
Page 21 of 106
Category 3. Merchant quality
Number Variable name Variable
type Description
Studied by
1. cb_num Interval Total number of
chargebacks suffered by
the merchant.
2. cb_val Interval Total value of char gebacks
3. ref_num Interval Total number of refunds
performed by merchant.
4. ref_val Interval Total value of refunds
performed by merchant.
5. pay_rel_num Interval Total number of payments
released to the merchant.
6. pay_held_num Interval Total numbe r of payments
held for merchant.
7. total_fraud_cases Interval Total number of fraud
cases.
Table 4. Category 3 of independent variables. Source: Own -making.
Page 22 of 106
Further explanation
While the first three variables in category 1 (RFM -L) are readily understood, it is necessary
to add some more words about the other variables to understand their significance:
Rec_trans_num , as mentioned above, refers to the number of recurring transactions.
Recurring transactions represent transactions i n the form of subscriptions. Customers of
merchants are automatically billed if they use such types of transactions. Often times clients
forget about the fact that they have active recurring transactions, or merchants forget or
are late to disable them, an d this usually leads to chargebacks. Chargebacks represent
amount of money that a merchant must pay back to a client if the clien t disputes the
transaction at his or her bank.
3-D Secure transactions are transactions that require 2 -Factor Authentication.
Authentication is the process of making sure that the person making the purchase is actually
the owner of the card. This is usually done by asking the customer to introduce in the
payment window the string of characters received in an SMS sent by the clien ts bank. 3-D
Secure transactions are relevant because they are a very well known cause of customers
dropping their virtual carts. When faced with the window requesting them to authenticate
themselves, customers often panic and leave the checkout. Clear haus sometimes requests
merchants to use 3 -D Secure.
Cb_val represents the total value of chargebacks that the merchants incurred. A chargeback
represents an amount of money the merchant has to return to a customer after a disputed
transaction. High levels o f chargebacks per month i.e. over 0.05 (total number of
chargebacks/total number of transactions) lead to penalties for Clear haus for supporting a
troublesome merchants.
Pay_held represents the total number of payments Clear haus has reserved from the
account of the merchant in order to protect itself from fraud.
Page 23 of 106
2. Descriptive statistics
Page 24 of 106
Page 25 of 106
3. Data cleaning and refinement
This part represents the data mining natur e of the current research paper .
Data cleaning and refinement represents the longest and most important proce ss in data
mining due to it developing the main resource for statistical model development : “cleaned”
data (Berry & Linoff, 2004).
3.1 Variable conversion
By looking at the distribution of the variable gateways_num it was seen that the only values
taken o n by this variable were either 1 or 2, 1 being the dominant value by far, having a
frequency in the outcomes of 0.006%. Hence, the variable was converted from an interval
variable to a nominal variable.
3.2 Removing outliers
3.2.1. Pre -imputation
Outliers represent extreme observations, unusual occurrences that are not representative
to the rest of the data set. They can constitute either recording errors or, in special
instances, peculiar cases (Berry & Linoff, 2014). Outliers are especially problematic f or
methods like neural networks, linear regression or logistic regression, especially if they are
significant outliers. As a result, outliers are generally removed from data sets (Al -Ahmadi, Al-
Ahmadi & Al -Amri, 2014) .
To start with, observations can be c onsidered outliers, thus removed, if they have values
that do not meet a certain threshold e.g. observations that occur less than x% in a data set.
This is especially the case for observations that taken on values part of a rare nominal level.
For interval data, those observations that take on values that are n standard deviations
away from the mean can represent potential outliers. After screening, the prospect of
deleting these observations needs to be in logical harmony with the case at hand. This
practi ce is especially common in data mining (Berry & Linoff, 2004) .
When working with interval data and one of the methods considered for model
development is logistic regression, one can remove outliers by running a preliminary
regression o f the log -odds (log it) values of the dependent variable on the independent
variables and plot the residual values (i.e. the differences between the observed and fitted
Page 26 of 106
values in the logistic regression) (James et al. 2013). Outliers can be removed after visually
inspecting t he plot. In spite of this, the outlier’s importance cannot be assessed visually. To
compensate for this, outliers can be evaluated by standardizing the residuals. Those
observations which have standardized residuals larger than the value of 3 in absolute t erms
represent significant outliers and observations with Cook’s distance greater than 1
represent influential observations and need to be removed ( Al-Ahmadi, Al -Ahmadi & Al –
Amri, 2014; Field, 2000; James et al. 2011).
The steps mentioned above were exac tly the ones performed in this paper. Before any
outlier removal took place, the distribution and frequency plots of the variables were
visually inspected. For example, the nominal variable merchant_mcc had most of its
observations (approx. 76%) concentrat ed in the Group 5000 level, while both Group 1000
and Group 2000 had less than 0.5% of the observations. A similar example, but of an interval
variable is the case of the variable cb_val . Its mean was around 20, 301 euros, 99.7% of its
values were until 15 8, 0784 euros, but had 2 values over 1.4 million euros. Thus, it was
certain that outliers were present, especially since Clear haus’ majority of merchants tend
not to have such high levels of chargebacks. Initially, interval variables were screened for
potential outliers and set to have values which were 3 standard deviations away from the
mean removed. This process not only removed outliers, but also improved the level of
skewness across the majority of the interval variables. For example, for monetary , skewness
decreased two times (from 21.97 to 10. 75). Although filtering had a positive effect on
skewness, the variables continued to be very skewed. So, removing outliers would not mask
the actual distribution of data and transformations can be performed bo th before and after
outlier removal.
Nominal variables were filtered out based on their percentage in the data sample i.e. less
than 1%. In the process, several levels were removed across three different nominal
variables. For example, for gateways_num , the level “2” was removed due its occurrence
being under the threshold. As a result, the variable became unary and, consequently, was
removed from the dataset. Other examples of levels being removed were Group 1000,
Group 2000 for variable merchant_mcc , and for merchant_currency were NOK and SEK. It
was not a surprise that there were very few merchants with accounts in Swedish or
Norwegian crowns, taking into consideration that Clear haus’ primary focus was on the
Danish and English -speaking markets. The Swed ish and Norwegian have some of the highest
entrance barriers in Europe.
These cut -off points used above were based on industry standard values for data mining. A
total of 526 observations were removed as a result of the two steps mentioned before (SAS
Institute, 2018) .
3.2.2. Post -imputation
It has to be noted that for interval variables, the filtering based on standard deviations was
not the last step in detecting and removing outliers. After imputing the missing values as
mentioned in section 3.4.7.4 below, a logistic regression of account_status was run in order
Page 27 of 106
to determine the residuals with absolute values greater than 3 and Cook’s Distance greater
than 1 . After performing the regression, 118 observations were detected to hav ing
significant outliers. Hence, they were removed, leading to the dataset having 5814
observations in total.
The reason why the outliers were not removed all at once was because imputations were
necessary before running a reliable logistic regression as in section 3.2.2 .
3. 3. Grouping levels
The nominal variables merchant_country , merchant_currency , merchant_mcc and
gateways_name had over two levels, which did not pose a problem for decision trees ,
however, for logistic regression, the levels of such variables need to be trans formed into
dummy variables, and grouped if necessary (James et al. 2013). For example, variable
merchant_mcc had 5 levels ( Group 4000 – Group 8000) each of which was reconfigured into
a dummy binary variable e.g. dummy_group_4000 which displayed 1 if the merchant was
within the category or 0 if not. In the cases of merchant_country and gateways_name , each
had over 30 levels, and thus creating 60 plus levels may have led to overfitting the model.
Consequently, the levels were grouped by percentage frequency . For example, in
merchant_country , the level DK i.e. Denmark accounted for 86.15% of the observations in
the variable, as a result, the level received its own dummy variable coded
dummy_country_over_86 . The biggest concern with this approach was that 28 c ountries,
each of which represented less than 1% of observations, were grouped into one category
dummy_country_1 . However, the variable monetary was examined and concluded that the
value of the transactions was very similar across the 28 levels and distinc t from the other 4
i.e. CY (Cyprus) , IE (Ireland) , GB (Great Britain) , DK (Denmark) .
3.4. Missing values
Missing values in a dataset represent the lack of recordings for several obs ervations
distributed over various variables. The consequences of missing values in data mining are
the loss of potentially important information , and the difficulty in developing models that
can yield reliable results, thus, making the process of inference complicated (Feng et al.
2008).
According to Rubin (1976) and Dong & Pe ng (2013) missing values can occur in three
scenarios (mechanisms): MAR (missing at random), MCAR (missing completely at random)
and MNAR (missing not at random).
3.4.1. MAR
In relation to the previous paragraph, if the distribution of missing values depen ds solely on the
observed values, then scenario for missing values is MAR (Dong & Peng, 2013; Rässler, Rubin &
Zell, 2013).
3.4.2. MCAR
Page 28 of 106
The MCAR scenario is the one in which the probability of missingness in M is independent of
all data values, regardless if they are missing or observed . As a result, if missing values are
considered to be part of the MCAR scenario, then they can be considered to represent a
random sample of the data set. MCAR represents a subcase of MAR ( Dong & Peng, 2013;
Rässler, Rubin & Zell, 2013).
3.4.3 MNAR
MNAR is the scenario in which the probability of missing values depends only on the level of
missing values, irrespective of the observed ones. MNAR is a more special mechanism to
manage as it is not ignorable (see below). It r equires the inputs of the researcher in order to
account for the relationships of missing values, thus making the process more difficult in
comparison to MCAR and MAR ( Dong & Peng, 2013; Rässler, Rubin & Zell, 2013 ).
3.4.4. Ignorability
Regarding the rel ationship between MAR, MCAR and MNAR, if the scenario for the missing
values is MAR, and the data model’s parameter and the parameter for missingness are
independent, the mechanism for the missing data is considered ignorable (Dong & Peng, 2013; Little
& Rubin, 2002). Because they tend to be independent in most cases, the question of ignorability of
the mechanism translates to whether the missing data are part of t he MAR scenario or not (including
MCAR) (Dong & Peng, 2013; Rässler, Rubin & Zell, 2013; Allis on, 2001).
3.4.5. MCAR vs. MAR vs. MNAR
The missing data mechanism type dictates how missing values can be treated. Thus, it is
important to identify the one at hand. The first step is to determine whether the mechanism
is MNAR or MAR. This can be done in directly by using Little’s chi -square test which specifies
whether values are MCAR. For a p -value of greater than 0.05, then the null hypothesis, i.e.
the missing values are MCAR, cannot be rejected and the missing values are assumed to be
MCAR (Phong & De ng, 2013 ). If, however, the p -value is lower than 0.05, the null hypothesis
is rejected and mechanisms at hand can only be MAR or MNAR. It is uncertain which
mechanism is the one though. The difference between MAR and MNAR is based on
assumption. A method that can be used to aid in decision making are simple t -tests to test
the difference in means between the sample with missing values and the one with no
missing values (Phong & Deng, 2013; Little & Schenker, 1995).
3.4.6. Patterns of missing values
Missin g data can exhibit three different patterns: univariate, monotone and random. ( Dong
& Peng, 2013) . If a fixed group of observations have missing values on at least one common
variable, then the pattern of missing values is univariate. Furthermore, a patte rn is
Page 29 of 106
considered to be monotone if by removing one variable with missing values the remaining
variabl es disappear. Monotones are said to have a staircase -lick pattern . However, if missing
values are arbitrarily placed in a data set, then the pattern of mis sing data is called random
or arbitrary (Phong & Deng, 2013).
After determining the mechanism and the pattern of the data, several methods can be used
to treat missing values.
3.4.7. Methods for treating missing values
3.4.7.1. Removal
First of all, if a variable displays missing values and it can be assumed that they are MCAR,
then elimination of the values from the data set (also known as listwise/case deletion) can
be done without creating any bias because, as mentioned earlier it is assumed that the
missing values represent a random sample of the data set. This method is quite popular in
the research world. It comes at the cost of decreased statistical power and lost information.
However, deleting missing values is not appropriate solution if dataset s are small (Acock,
2005).
3.4.7.2. Dummy variable adjustment (DVA)
If the missing data is considered to be especially MNAR, then the variables with missing
values can be replaced with dummy variables which can express whether an observation
has a missing value for that variable (“1 ”) or that it does not have a missing value (“0”). This
approach has been suggested by Cohen & Cohen (1983) and Cohen et al. (2003). The benefit
of DVS if that it can aid in prediction, but it will not capture the uncertainty re lated to the
missing values (Acock, 2005).
3.4.7.3. Single imputations (SI)
Missing values in a variable can be replaced with one unique value (single imputation) if it
can be assumed that the mechanism is MAR (Rässler, Rubin & Zell, 2013). While methods
such as replacing missing class values with the most common occurring value in the specific
class (mode), and missing interval values with the mean value present in the respective
variable, are traditional, they are problematic (Hwang, Jung & Suh, 2004; Acock, 2005 ). They
bring a lot of bias to the study by increasing Type II errors, producing possible high
correlations, altering the strength of coefficients and affecting the level of variance (Rässler,
Rubin & Zell, 2013; Acock, 2005). Luckily, there are a lternatives. For example, if the pattern
of missing data is either monotone or univariate, values can be replaced using the results
obtained from performing a regression of the variable with missing values on other
variables. Also, if the pattern of the mi ssing data is non -monotone, the iterative
expectation -maximization algorithm represents a further alternative (Rässler, Rubin & Zell,
2013).
3.4.7.4. Multiple imputations (MI)
Page 30 of 106
Again, assuming MAR, instead of single imputations resulted from using only one set of
candidate replacement variables, multiple sets can be tested. The purpose of MI is to reduce
the noise and uncertainty resulted from single imputations (Azur et al. 2011).
According to Phong & Deng (2013), a method that performs MI does so in thre e phases:
1. Impute missing values several times to create n data sets with no missing values ;
2. “Train” the data sets with the help of a statistical method e.g. logistic regression in order
to obtain groups of parameter estimates
3. Combine (pool) the r esults for all groups for the final estimate using the formulas of
Schafer (1997) or Rubin (1987). These three phases represent the MI process.
In order to use a MI method, it has to fit the proper setting. For example, the Markov Chain
Monte Carlo (MCMC) and the regression methods both require that the variables with
missing values to be normally distributed. However, regression can be used only if the
pattern of missing data is univariate or monotone, while MCMC is employed mainly when
there is an arbitr ary pattern (Phong & Deng, 2013).
When the pattern of missing data is arbitrary, or the variable with missing data is either
categorical or has a non -normal distribution, an iterative MCMC method named the fully
conditional specification method (FCS) (als o known as multiple imputation by chained
equations or “MICE”) can be used (Phong & Deng, 2013; Raghunathan et al. 2001; van
Buuren, Boshuizen & Knook, 1999; van Buuren et al. 2006).
3.4.7 .5. FCS/MICE
The FCS method works by using each variable with miss ing values as a dependent variable in
a separate model where the variables with no missing variables serve as predictors. With
the help of the predictors, the missing values are imputed and this process can continue for
several iterations, the standard num ber is 5 -10 (IBM, 2018). As mentioned earlier, FCS is
very flexible when it comes to working with variables with missing variables. The method
has several imputation procedures (models) that can be used on variables with different
distributions: linear reg ression (dependent variable is interval), logistic regression
(dependent variable is binary) or predictive mean matching (dependent variable is interval
and extremely skewed) (IBM, 2018) .
PMM can be considered a semi -parametric procedure because it uses l inear regression , but
it replaces missing values in a random fashion, using a group of observed values that have
their predicted values the closest to the predicted value attributed to the missing value from
the regression model that was simulated (SAS, 20 18; Heitjan & Little, 1991; Schenker &
Taylor, 1996). PMM can also be thought of as a type of a K -nearest neighbour because the
missing values are determined from “close” complete observations however, its function is
not a real distance function because i t can actually equal zero (Schenker & Taylor, 1996; Di
Zio & Guarnera (2008). The strong benefit of PMM is that it is able to impute “real” values,
i.e. values which exist in the dataset and are not artificially created , like means (Di Zio &
Guarnera, 2008 ).
Page 31 of 106
PMM has been used across several across different fields from survey analysis to medicine ,
and economy, such as in the works of Vink et al. (2014) , Kleinke (2018), Lee & Carlin (2016)
and Di Zio & Guarnera (2008).
3.4.7.6. The case
On the subject of v ariables, in this paper’s dataset, there were no missing values among the
nominal variables. However, there were missing values present in the binary variable
physical_delivery , i.e. 1519 missing values. Also, the interval variables that had missing
values were ref_num and ref_val . A tabulated patterns table2 was created to see if joint
patterns of missing values took place in more than 1% of the merchants. It was seen that
the two interval variables displayed a clear pattern of missing data. ref_num and ref_val had
joint missing cases in 2695 cases . After looking also at the pattern chart, it was concluded
that ref_num and ref_val had missing variables mostly likely in the same observations. This
was not surprising, since both variables recorded the same phenomenon but in different
measures. physical_delivery shared 605 missing values occurrences with both ref_val and
ref_num and retained missing values in 708 cases on its own. This amounted to 3403
merchants that had some sort of missing values.
With res pect to the target variable account_status , when only ref_num ’s and ref_val ’s
values were missing, the number of non -churners was approx. 4 times greater than the
number of churners. When there were no missing values in the respective variables, the
propor tion was only a little higher i.e. 4.6%. In the end, all variables were compared with
ref_val, ref_num and physical_delivery using a simple t -test. This was done to check
whether the differences in means between the samples with missing values and the
samp les with no missing values were statistically significant. The p -values for the differences
were significant as expected, with score of 0.000. For example, the mean for cb_num when
rf_num had missing values was almost 10 times lower (0.39) than when it had no missing
data (3.54). For physical_delivery, all differences in means were significant with the
exceptions of pay_held_num ’s and pay_rel_num ’s means. In addition, the assumption of no
MACR mechanism being present was confirmed by Little’s Chi -square tes t. It returned a
value of 0.00, rejecting the null hypothesis that the missing values had a MCAR mechanism3.
Thus, case deletion did not represent an option to deal with the missing values . Single
imputations via means was not considered as an alternative because of its inferiority to MI.
As a result, the latter, MI, was selected as the appropriate means of dealing with missing
data.
The next step was to decide whether the MAR or MNAR mechanism was at hand. Since
there was no information available to deter mine why the data was missing it was preferred
to consider the mechanism as being MAR. Even if this assumption was flawed, using
techniques for MAR instead of MNAR is less biased than imputing means or removing cases
(Rässler, Rubin & Zell, 2013). By assum ing MAR and checking the pattern chart4, it was
2 Appen dix 3.
3 Appendix 4.
4 Appendix 5.
Page 32 of 106
argued that the missing data was leaning towards monotone. Then, the distributions of
variables ref_num and ref_val were inspected. Both of the variables were extremely right
skewed. Natural log (log) transfo rmations were applied to the variables in order to
smoothen their distributions. The variables were recoded as log_ref_num and log_ref_val .
Despite the transformations, the variables maintained their non -normal distribution. Hence,
using simple linear regr ession to impute the missing would have not been appropriate
because of the violation of the normality assumption in the dependent variables. Hence, FCS
with PMM for log_ref_num and log_ref _val was selected as the appropriate MI imputation
method and mode l instead. FCS was also used for physical_delivery but logistic regression
was preferred for imputation as the variable was dichotomous.
Step 1 of MI was completed by creating 5 new datasets with no missing values. The missing
values were replaced with im putations set to resemble the distribution of the observed
values in the variables. Across all datasets, the means for the imputed values of the
variables were slightly underestimating the mean of the original observed data5. The same
was the case in terms of the maximum value imputed. In contrast, the minimum of the
imputed values in all 5 groups respected the minimum value of the observed values. In the
end, the difference was not considered too large to invalidate the groups. The maximum
value in the obs erved value from the original data set was while for the values in group 3 it
was This of course serves as a limitation, especially taking into account that when plotting
the values of means and standard deviations of the imputations of the 5 groups, the p attern
was not random6.
Step 2 of MI was executed by running 6 preliminary logistic regressions, one on each of the
datasets: original data set and the 5 data sets with imputations. Step 3 of MI took place by
pooling the results for the 5 complete data se ts into one final estimation of the parameters
(pooled). The results of the regression with the original data represented the results of a
regression had the dataset suffered case deletions. When comparing the results of MI
versus case deletion it is wort h looking at: standard error (SE) and significance of the
coeff icients7. For example, the variable physical_delivery in the case deletion scenario had a
SE of 0.466. The SE for the variable in scenario 3 of MI was .213. The result for the “pooled”
regressi on was .373. Thus, a decrease in variability parameter estimation . This meant that
by using MI, the respective regression had a higher precision in estimating the coefficient.
Other examples of variables which experienced a decrease in variability were: th e dummy
variables for the categorical variables merchant_country, merchant_mcc,
merchant_currency, gatways_name, and the interval variables log_ref_num,
other_debit_credit_num, pay_rel_num (very small decrease), as well as the constant.
Regarding signific ance, most of the variables that were found to be significant/insignificant
in the regression with case deletion were also significant in the regression with the other
data sets. One exception was log_ref_val . Its p -value in the regression with case deleti on
was 0.005, while in the regression with MI group 3 had a p -value of 0.335. Also, variable
5 Appendix 6.
6 Appendix 7.
7 Appendix 8.
Page 33 of 106
dummy_group_8000 which was insignificant at 0.439 with case deletion became significant
in all regression s with MI8.
Based on the results above, it was decided th e MI was a better option when performing
logistic regression in any of the five data sets, than case deletion. The bias of imputing
computed values came as trade -off to the improved level of variability (variance) . After
obtaining the complete data sets, t he next step was to continue the data cleaning and
refinement process.
Theoretically, one should use all of the five datasets obtained and run new regression s after
further cleaning the data, and pool the results again to continue to performance assessmen t
of MI. However, as a limitation to this paper , the process of MI was halted by selecting only
one of the complete group s result ed from MI for statistical modeling . The group chosen was
group 3 due to it having its imputed values very close to the means o f the original dataset.
4. Variable transformations
By analysing the distribution of the interval variables in the dataset, it was observed, as
mentioned previously, that the interval variables in the data set were very, very skewed to
the right. This did not constitute too much of a surprise, taking into account the business
nature of the company. For example, cb_num and cb_val are both highly skewed to the
right because Clear haus is obligated by Visa and Mastercard as well financial authorities to
maint ain as customers only merchants with low levels of chargebacks, both in terms of
number and in value. So, having the vast majority of the data distributed to the left side of
the density graph (first bin) and very few examples to the right was expected. Re turning to
the variables above, it was observed that the majority of merchants which had extreme
values for cb_val also had large values for cb_num , and had their accounts closed by end of
the observation period. Due to the majority of the interval variab les being skewed to the
right, logarithmic transformations (log) were performed initially in order to attempt to
normalise the distribution of the data and obtain a better fit. The variables ref_num and
ref_val were already transformed in order to perform MI, despite the fact that ref_num was
a discrete variable. Although log transformations being some of the most popular
techniques used to smoothen/normalise data (Feng et al. 2014), they did not offer any
overall substantial changes to the distributions o f the variables9. However, more
importantly, it was argued that by keeping the variables in their initial form would yield a
simpler interpretation and could provide a contrast on how one -unit changes in
independent variables compare to k-fold changes and their effects on the dependent
variable .
5. Class imbalance and the data for the train and test samples
In studying churn, it is most often the case that the number of non -churners outnumber
those of churners across datasets. As a result, these datasets suffer from the class
imbalance problem (Japowicz, 2000; Huang, Kechadi & Buckley, 2012). A training set is
8 Appendix 8.
9 Appendix 9.
Page 34 of 106
balanced when the ratio between the number of churners and that of non -churners is equal
to 1. The class imbalance problem makes it more difficult f or classifiers to learn from the
data which of the customers are more likely to churn. This leads to models with reduced
classification performance. To overcome this problem and insure that classifiers learn
properly, researchers can use sampling technique s like oversampling and undersampling to
balance data (Verbeke et al. 2012). The importance of using a training sample that is
balanced has been well documented in the research world for example in the work of
Coussement and Van den Poel (2008), Verbeke et al. (2012) and Gordini & Veglio (2017). In
conclusion, the practice of training data on sets with non -natural distributions is fairly
popular and was confirmed also by Weiss & Provost (2001).
In the present dataset, after the data processing process, the re were 5814 observations in
total. Out of these 5814 observations, only 924 (approx. 15, 89%) were churners, while the
remaining 4890 observations (approx. 84, 11%) represented non -churners. The ratio of churners -to-
non-churners was 1:5.3, or approximatel y 1:5. In this paper, random undersampling of the majority
class (non -churners) and random oversampling in the minority class (churners) was used to balance
the training dataset and offer enough input for the model to learn the data. Also, the test set use d
for model performance reflected the natural distribution/proportion of churners/non –
churners in the original data set.
The process of developing the balanced training set started off by dividing the data into two
equal data sets – training and testing, e ach of which had 2907 observations. Since the
training set had only 462 churners and 2445 non -churners, the number of churners was
randomly oversampled, while that of non -churners was undersampled. The proportions in
the training set were not altered, taki ng into account that they reflected the “natural
distribution” of the original data set. This process can also be seen in the works of Veglio &
Gordini (2017), as well as that of Verbeke et al. (2012), among others. The samples used for
the training and te st set can be seen below:
Data set Number of observations Percentage
Training set
Non -churners 1453 50.02
Churners 1454 49.98
Total 2907 100
Test set
Non -churners 2445 84.11
Churners 462 15.89
Total 2907 100
Table 5. Training and test set distributions. Source: based on Veglio & Gordini (2017).
Page 35 of 106
After obtaining the above datasets, the next phase represented training the two models
proposed on the training sets and evaluating the results using the test set.
CHAPTER V – ANALYSIS AND RESULTS
SAS Enterprise Miner Workstation 14.2, RStudio version 1.1 and SPSS25 were the main data
mining and statistical pieces of software used in this paper.
1. Logistic regression
In this paper, in order to be able to relate to the methods used in past literature, a simple
logistic regression with main effects was used. Since the use of logistic regression mandates
accounting for several assumptions, their study was performed and reported below .
1.1. Assumptions of logistic regression
1. Multicollinear ity was analysed using the correlation coefficient matrix in which a
correlation score of 0.9 or greater, and a VIF score of over 10 signalled the presence of
multicollinearity (AL -Ahmadi, Al -Ahmadi & Al -Amri, 2014). No independent variable
recorded a VIF value greater than 4, the majority being in the 1 -2 interval, thus, the
assumption of lack of multicollinearity was fulfilled10.
2. Independence of errors was analysed using the Durbin -Watson. The Durbin -Watson
statistic had a value of 2.01. Since the value of the Durbin -Watson test was very close to 2,
the assumption of independence of errors was fulfilled (Field, 2013)11.
3. With regards to the assumptions of sufficient cases, the processed dataset used in this
paper contained 5814 observations, out of whi ch 4890 non -churners. Since in the minority
class there were 924 churners and the overall number of variables was 33, the minimum
ratio of outcomes -to-independent variables of 10:1 was well respected (i.e. 330 outcomes
was the bear minimum required to cons ider the experiment).
4. The significant outliers were analysed by inspecting the standardized residuals of the
cases. As a rule of thumb, any standardized residual with an absolute value greater than 3
10 Appendix 10.
11 Appendix 10.
Page 36 of 106
represents a significant outlier. In the dataset, th ere were 116 standard residuals greater
than 3, thus, 116 significant outliers. However, not all of the aforementioned outliers
represented influential points. The influential points were obtained using Cook’s Distance
values. 2 observations had a Cook’s Distance value greater than 1. To improve the fit of the
model, the significant outliers and the influential points were removed.
5. The assumption of a linear relationship between the logit transformation of the target
variable and the independent variabl es that are continuous was tested by calculating the
log-odds and then plotting them against each independent variable in separate scatterplots.
Besides using visualisations, the assumption of linearity was also tested using the Box –
Tidwell procedure. By l ooking at the individual plots, it was observed that the continuous
independent variables and the log -odds had approximately linear relationships, though
violations were present and represent a limitation to the use of logistic regression in this
study12.
1.2. Logistic regression with main effects results
The logistic regression with main effects was constructed using the training set which used a
balanced data sample with an overall total of 2907 observations (i.e. approx. 50% -50%
churners and non -churne rs). However, the performance of the logistic regression was
measured on the random the test set which was calibrated to have similar proportions
between the two outcomes pertaining to the original data set (i.e. approx. 84.11% non –
churners and 15.89% chu rners ). For the dummy variables which represented either
individual or grouped levels of the nominal variables, reference variables were selected in
order to interpret their effects. The criterion for the reference variable was that it contained
the highes t percentage of frequency of observation. So, for example, for variables
dummy_country_1 , dummy_country_under_1_5 , dummy_country_2_5 and
dummy_country_4_5 , the reference variable was dummy_country_over_86 because over
86% of all observations had values par t of this level, which, incidentally, was “DK”. The
theoretical basis for this reference variable was that it was interesting to see how merchants
from niche markets were more likely to churn than the main target group of the company.
The selection procedu res used in the logistic regression was that of backward selection. The
purposes of utilising a selection procedure were firstly, to avoid overfitting the model by
only including those variables that had proved to be significant. A p -value below 0.05 was
selected as the procedure criterion, i.e. all variables with p -values above 0.05 were dropped
from the model. Secondly, from a business perspective, having a model that returns only
the most relevant variables decrease time needed for understanding the mode l, thus
allowing a company to focus more rapidly on the core elements that may signal churn.
The logistic model with main effects (and backwards selection) had the following statistics:
Likelihood Ratio Test for Global Null Hypothesis
-2 Log Likelihood DF Pr > ChiSq
12 Appendix 11.
Page 37 of 106
Intercept Only Intercept &
Covariates Likelihood
Ratio Chi –
Square
4029.957 1894.511 2135.4461 20 <.0001
Table 6. Logistic regression main effects output. Source: own -making.
The likelihood ratio test refers to comparing the statistical s ignificance of two separate
models: the “null model” which in this case includes only the intercept, and the model fitted
with the covariates. The null hypothesis states that the intercept -only model is better at
predicting the dependent variable then a fi tted model (Penn State Eberly College of Science,
2018; Tauscher, 2013).
It is reasoned that the greater the value of the ratio, the lager the difference between the
intercept -only , and proposed model, thus, higher p -values for rejecting the null hypothes is
(Penn State, 2018).
With a large difference in the likelihoods (Likelihood Ratio Chi -Square, or LRCS, equal to
2135, 4461), an expected low p -value was obtained . The p -value was denoted by Pr > ChiSq
and had a value lower than 0.05 which confirmed that the fitted model was statistically
significant. Hence, the fitted model with covariates was better at explaining the variance in
the log -odds of the dependent variable than if a null model was used. This translates to the
object of this paper t o: the vari ance in the log -odds of account_status was dependent on
some of the independent variables. Which was supposed to take place , taking into account
some the plots of the log -odds and independent variables. In addition, DF were the degrees
of freedom in the fi tted model which represented the final number of independent
variables imputed in the model.
Based on the analysis of maximum likelihood estimates, the variables part of the logistic
regression with main effects which had statistical significance were the following:
No. Variable name Reference variable (if
applicable) Estimated
coefficient Significance
p-Value
1. dummy_currency_usd dummy_currency_dkk 1.9937 0.0001
2. dummy_currency_eur dummy_currency_dkk 1.2795 0.0001
3. dummy_gateway_under_2 dummy_gate way_56 -1.2477 0.0001
4. dummy_group_4000 dummy_group_5000 -0.9228 0.0001
5. dummy_country_under_1_5 dummy_country_over_86 0.9251 0.0372
6. dummy_gateway_22 dummy_gateway_56 -0.7275 0.0001
7. dummy_group_8000 dummy_group_5000 0.6528 0.0003
8. dummy_gr oup_7000 dummy_group_5000 0.5451 0.0001
9. dummy_country_4_5 dummy_country_over_86 -0.4204 0.0295
10. dummy_gateway_11 dummy_gateway_56 -0.3130 0.0008
11. dummy_country_under_1 dummy_country_over_86 -0.3624 0.0456
12. log_ref_num NA13 0.1888 0.0003
13. recency -||- 0.0107 0.0001
13 Not applicable.
Page 38 of 106
14 cb_num -||- 0.00996 0.0303
15. pay_held_num -||- 0.00491 0.0001
16. other_debit_credit_num -||- -0.0139 0.0032
17. debit_num -||- 0.000867 0.0008
18. credit_num -||- 0.00140 0.0106
19. monetary -||- 0.0006 0.0001
20. cb_val -||- 0.000171 0.0001
* intercept -||- -4.9155 0.0001
Table 7. Significance of variables for logistic regression with main effects ordered by
coefficient estimate . Source: own -making.
Logistic regression formally analyses the relationship between the log transformation of the
odds ratio of the dependent variable and the predictors. For example, the effect of one unit
increase in recency, while keeping the other variables at fixed value, on the log -odds of
account_status would be 0.010. Conversely, the effect on the odds ratio of a merchant to
churn/be would obtained by exponentiating the coefficient 0.0107 . In other words, if a
merchant was to increase the days since it had a transaction by one day, then his odds of
churning would have increased by 1.1%.
Example.
While odds relate to probabilities in the sense that those merchants with greater odds of
churning will also have greater probabilities (but not to the same extent) , the direct
calculations of probabilities can be performed using Equation 1 . For example, the
probability for a merchant to churn that:
1. Incurred a total of 20 chargebacks;
2. Had a total chargeback value of 1000 euros;
3. Experienced 50 transactions made with credit cards;
4. Experienced 60 transactions made with debit cards;
5. Was from Denmark;
6. Had his account in euros ;
7. Was the customer of one of the gateways that had less than 2% of the other merchants;
8. Operated in group 7000 e.g. was offering laundry, cleaning, and garment services14;
9. Suffered a 2.72 -fold increas e in the number of refunds performed;
10. Had made transactions worth 10,000 euros;
11. Experienced 20 transactions with other cards than debit or credit;
12. Suffered 10 payments held;
14 This is just one possible scenario.
Page 39 of 106
13. Did not have any transactions in 30 days ;
With help of Equation 1.
=EXP( -4.9+ 20*( -0.009)+ 0.00017*(1000)+0.0014*50+0.0008*60+(1.28)+0.54+( –
1.25)+0.18+20*( -0.0139)+0.005*10+0.004*10+0.01*30)= 0.019
If, however, the merchant above was from any of the under 1% minority of Clear haus,
keeping the values for the predictors fixed, the probability of churn would have been :
=-EXP( 4.9+ 20*( -0.009)+ 0.00017*(1000)+0.0014*50+0.0008*60+(1.28)+0.54+( –
1.25)+0.18+20*( -0.0139)+0.005*10+0.004*10+0.01 +(-0.3624) )= 0.013
Had the amount of recency been raised to 100 days, while keeping the values of the other
covariates fixed, then the probabilities of churn for a Danish merchant , or one from one of
the <1% minority countries to churn , would have doubled to 0.038 for the Danish merchant ,
and 0.027 for the minority country. Moreover, returni ng to the original example, had the
currency been switched from euros to Danish crowns keeping the other values fixed, then
the probabilities of churning for the two would have decreased almost two times to 0.01 for
a Danish merchant, and 0.007 for a minor ity country.
From the above table and calculations, it can be seen that the top 6 variables (in order of
coefficients) that carried the most importance in influencing the change in log -odds of
account _status , consequently in the probability of a merchant to churn were:
1. Merchant _currency
2. Merchant_group
3. Merchant_gateway
4. Merchant_country
5. Log_ref_num
6. Recency .
1.2.1. Model performance
Observed Predicted
Churner/Non -churner Sum Percentage
correct (%) Error
(%)
Churner Non –
churner
Page 40 of 106
Churner/Non –
Churner Churner 408 54 462 88.31 11.69
Non –
Churner 330 2115 2445 86.50 13.50
Sum 738 2169 2907 85.8 12.59
Table 7. Logistic regression Type I and Type II errors test set . Source: own -making, based
on Nie et al. (2011).
1.2.2. Model i nterpretation
The fact that higher number of transactional inactivity (high recency ) is a significant
predictor for churn has well been documente d in literature. However, the current
occurrence of the “nationality” of the merchant, its product/service offe ring, and the
currency in which the merchant was funded , had a greater impact on churn and constituted
new insights into chur n. These insights also have a basis in reality. In Clear haus, most of the
merchant accounts are funded in Danish crowns because mos t of the accounts are owned
Danish merchants. Danish merchants with accounts in euros are very rare and considered a
peculiarity in Clear haus , a pattern to pay attention to . It can be that the merchant is selling
goods or services which are not permitted i n Denmark , but are in other countries i.e. Euro –
zone , and the merchant needs its account to be funded in euros. However, non-Danish
merchants coming from plus 20 countries (part of the <1% group) to have accounts in euros
is the standard. Thus, it was no s urprise that Danish merchants with accounts in euros were
found to have a higher probability to churn than the minority countries. On the subject
goods or services sold, it was expected that merchants from Group 7000 and 8000 to be
more likely to churn bec ause many of these merchants represent the ones with high -risk
product/service offering s, such as online gambling, dating services and other. These
business models are very volatile. By exponentiating the coefficients of Group 7000 and
Group 8000 it result ed in changes in the odds ratios of churning of 72% and 92%. Comparing
this to recency clear ly communicates that the business model of the merchant plays an
important role in merchant churn in Clearhaus. High -risk merchants usually churn because
of two rea sons: 1. Bankruptcy; or 2. Inability to fulfil Clear haus regulations; 3. Dissatisfaction
with Clear haus’ regulations. So , these types of merchants churning frequently is a true
phenomenon in Clear haus. Regarding refund numbers, refunds are considerate early signs
of chargebacks. A refund is basically the scenario in which a customer of a merchant
receives his or her money back from the merchant without having to rely on his or her
issuing bank. However, not all refunds get resolved. Sometimes they become ch argebacks. It
is a very common pattern to see that if the number of refunds increase s, the number of
Page 41 of 106
chargebacks increase as well . Log_ref_val, cb_num and cb_val have positive coefficients,
meaning they have a positive impact on churn. However, refund num ber is measured as a
2.72 -fold increase in the number of refunds, which, depending on the merchant, can range
from very little to a lot. So, it is not a surprise that having two times more refunds is more of
a factor for churn than just one extra chargebac k. This can be a sign that the merchant is not
performing well, having many of his goods or services returned. While this poses firstly and
most importantly a financial problem for Clear haus i.e. servicing unprofitable merchants, i t
can also be a legal one if merchants tend to convert refunds into chargebacks. Clear haus is
obliged by Visa and Danish Financial Supervisory authority to have merchants with
chargebacks ratios (number and value) under 0.05 (Dumitrescu, 2018).
On a final note, taking the log 10 or other bases for the variable cb_val could produce more
valuable insight , most likely , a noticeable increase in t he effect the variable has on the ratios
and probability of merchants churning.
2. Decision trees
In this paper, the C 5.0 tree algorithm w as used to develop a decision tree . Before describing
the process of developing the trees and the results, it is important to note that a decision
tree is able to handle missing values on its own by incorporating them in a new category,
Hence, the transfor mation of ref_num and ref_val would not have been necessary if the
decision tree was the only method used , or if another m ethod such as random forest would
have been used. The decision tree below used the missing values as categories on their own.
Decisi on tree C5.0
The decision tree C5.0 was created in SAS Enterprise Miner 14.2. As result, the splitting
criteria was set to Entropy (information gain ratio ). Then, the maximum branch was set to 2
due to having a binary target variable. Next, the number of surrogate rules was set to 5.
Surrogate rules are used as backup rules when the principal splitting rule uses an input that
has a missing value or more (SAS Institute , 2018 ). If the first surrogate rule also has an input
with a missing value, the second su rrogate rule is applied and so on (SAS Institute , 2018 ). If
missing values do not allow any of the surrogate rules to be applied, then the principal rule
take s the observation and allocate s it to the branch designed to receive missing values (SAS
Institute , 2018 ). The branch set to receive the missing values was the largest branch. It has
to be noted that these last two settings were relevant for growing the tree because of using
the “raw”, unprocessed dataset. The final tweaks before running the model were that of
specifying the tree to use the Decision assessment measure in order to select the tree that
produces the lowest loss, based on the profit/loss matrix. The decision tree was constructed
using the balanced training set and tested on the random test set.
After the tree was constructed, the output report was analysed in order to determine the
nature and performance of the tree, as well any possible improvements that may have been
Page 42 of 106
made. It was seen that by increasing the amount of Depth (i.e. the maxim um number of
node generations to be produced in the decision tree) beyond the number of 5 would have
led to increase s in the amount of gain and lift. As a result, the default Depth of 6 was
increased to 10. Also, the tree that had lowest misclassification error rate, averaged squared
error, sum of squared errors, maximum absolute error and greatest simplicity was one that
had a minimum number of 10 observations per terminal node. This step was also beneficial
for avoiding as much as possible overfitting on the training set by removing the very small
nodes that may have been specific to the training set. Following these changes, a tree with
a total of 2 2 nods was created15. The unpruned tree had a total of 30 leaves. Recency
represented the input node. 1 1 nodes out of these 2 2 were leaf /terminal nodes. As a result,
11 rules were generated by the C5.0 decision tree16.
The variables with the highest importance in classifying merchants into churners/non
churners can be seen in Table 8.
Table 8. Variable importance in C5.0 decision tree. Source: own -making.
The r anking of the variables above is based on the fact that these variables are the best at
gathering the largest collections of data records, and have the highest power of dividing the
data into groups where one of the outcome class (churner/non -churner) domi nates,
meaning, groups that have highest level of information gain which is a measure of purity.
The purpose of a classification tree was to determine which customers are more likely to be
churners than others. To interpret the decision tree, the rules ge nerated by it need to be
read. For example , here are some rules extracted . The entire list of rules can be seen in
Appendi x 12:
Rule 1.
1. If a merchant did not have any transactions in the last 192 days (recency );
2. And was from either Denmark, the Nethe rlands, Great Britain, Spain or Cyprus
(merchant_country );
3. And had more than 5 transactions in the last 48 5 day ( frequency );
15 Appendix 12.
16 Appendix 13. No. Variable Importance (te st set)
1. recency 1.0000
2. merchant_country 0.2762
3. frequency 0.1148
4. merchant_mcc 0.0880
5. mastercard_num 0.0788
6. gateways_name 0.0638
7. pay_held_num 0.0489
Page 43 of 106
4. And was part of Group 5000 , 6000 or 4000 (merchant_mcc )
5. The probability of the merchant churning would have been 55%.
Rule 2.
1. If a merchant was from either Denmark, the Netherlands, Great Britain, Spain or Cyprus
(merchant_mcc );
2. And did not have any transactions in the last 485 days ( recency );
3. An d had more than 12 payments held (pay_held )
4. The probability of the merchant churning would have been 71% .
Rule 3.
1. If a merchant was from either Denmark , the Netherlands, Great Britain, Spain or Cyprus
(merchant_mcc) ;
2. And did not have any transactions in the last 485 days ( recency );
3. And had less than 12 payment s held (pay_held );
4. And used as a gateway Pensopay, QuickPay, Clear settle or WePay (gateways_name );
5. The probability of the merchant churning would have been 50.36% .
Decision tree preliminary evaluation
Observed Predicted
Churner/Non -churner Sum Percentage
correct (%) Error
(%)
Churner Non –
churner
Churner/Non –
Churner Churner 377 85 462 81.61 18.39%
Non –
Churner 324 2121 2445 86.75 13.25%
Sum 701 2206 2907 84.17 15.72
Table 9. Type I and Type II errors C 5.0 test set . Source: own making, b ased on Nie et al
(2011).
Page 44 of 106
2.2. Model interpretation
From a business perspective, the scenario in Rule 1 . had a both literature and real -life
foundation. Higher values of recency have been heavily documented in literature to be a
signal of churn in customers. Businesswise, it can be easily argued that if a client did not
have any transactions in almost six months it is fairly safe to say that the client has either
gone out of business/changed business model or switched the supplier (churned). Als o, the
fact that Danish, Dutch, British, Spanish or Cypriot merchants were considered to be more
likely to churn was logical because , firstly, Danish merchants represent the bulk of
Clear haus’ c lients and the country is extremely found of technology, thus , merchants are
very accustomed to the concept of acquiring and would be very easy for them to switch
providers. Second, the Dutch and British markets have the highest level of competition
when it comes to payment solutions, including acquiring. Again, the most probable
explanation for churn may have been the great number of alternatives. Also , it is important
to notice that , naturally , the goods or services that the merchant was offering e.g.
periodicals/electronics/ insurance selling17 etc. (merchant profile – merchant_ mcc) influenced
their probability to churn. This variable relate s very well to the case of Cypriot merchants .
Almost all of Clear haus’ high -risk merchants come from Cyprus. Since they operate on a
volatile business model , they are very prone to bankruptcy or experience heavy regulations .
So, it could be deduced that instead of competition, the reason behind merchants from
Cyprus churning may have been actually involuntary churn “going out of business” or
voluntary churn “cannot do business with you on your terms” .
With respect to Rule 2. , the most important inference would be the effect of pay_held.
Payments held represent safety measures Clear haus takes when the merchant experiences
many chargebacks or when the merchants exhibit an unusual tran sactional pattern. By
holding payments, Clear haus creates for itself a financial back -up and does not fund the
merchant until the case of concern is solved. The amount of payments held can be used to
measure the quality a merchant has. However, it is not a lways the situation that the
17These are just a few examples.
Page 45 of 106
merchant has done something wrong. It can be the case that the merchant is changing its
business model, relocating or even wins a chargeback. By having payments held, merchants,
especially small businesses, can suffer by not ha ving the capital obtained from their
transactions at hand , capital which could be used to pay salaries, rent etc. Regardless of the
real reason, it is safe to argue that merchants in general shun not receiving their funds on
time . Consequently, the greater the number of payments held , the greater the possibility of
them wanting to churn seems highly .
Finally, on the subject of Rule 3 ., another variable part of the Merchant profile category
which surfaces to be of importance is gateways_name . Truly, this ha s a basis in reality.
Some merchants do not even interact with Clear haus directly. Gateways represent
Clear haus’ main partners. They are the most important sources of acquiring merchants.
Merchants sign up with the gateway and then they can select Clear hus or other partners of
the gateway as their acquirer. Clear haus many times relies on its partner gateways to bring
in good quality merchants i.e. those do not suffer many disputed transactions, haves solid
business models and so on. However, often times Clearhaus is faced with having poor
quality merchants and are forced to discontinue serving the merchant, and even the
partner. For example, Clear haus suffered a large number of merchants which incurred high
amounts of fraud brought in by Clear settle. At the end of 2017, Clear haus discontinued its
partnership with Clear settle and thus lost the portfolio of merchants brought by the
gateway (Dumitrescu, 2018).
3. Model comparison and selection
In order to select the best mode for answering the research question , it is necessary to
evaluate the logistic regression and the C5.0 decision tree with the use of the selected
criteria.
Model Data
set Accuracy Misclassification
Rate Mean
Squared
Error AUC
Lift Loss (in
euros)
Logistic
Regression
Main
Effects Train 0.877537 0.122463 0.086275 0.95 1.890305
Test 0.867905 0.132095 0.106379 0.932 4.035692 26,935.11
C5.0
decision
tree Train 0.899209 0.100791 0.918 1.8526299
Test 0.863433 0.136567 0.894 3.914261 31,965.26
Table 10. Model comparison. Source: own -making.
3.1. Accuracy
Accuracy or percentage of correctly classified (PCC) refers to the number of merchants
which were detected to be churners when they were actually churners, and the amount o f
Page 46 of 106
predicted non -churners that were in fact non -churners over t he entire sample of merchants .
Both the logistic regression and the decision tree had high scores of accuracy , however, the
logistic regression was more robust and managed to classify merchants more correctly on
an unseen data . However, on the training set , the decision tree had a higher level of
accuracy than the logistic regression. This may due to the fact decision trees have the
advantage o f not being affected by linearity violations in comparison to the logistic
regression . However, the logistic regres sion compensated on the test set , although with
very little, by registering a TN= 88.31% and a TP=86.5% , while the decision tree had a
TN=86.75% and a TP= 81.61%.
3.2. Misclassification rate
Misclassification reports the “opposite ” of accuracy, that is, it specified the number of
churners that were classified incorrectly as being non-churners ( FN) and the number of non –
churners that were classified incorrectly as being churners (FP) across the entire sample .
Due to lower levels of accuracy on the training s et in comparison to the decision tree, the
logistic regression incurred a greater misclassification rate=12.24%. However, just as
expected, the decision tree was surpassed by the logistic regression on the test west where
the decision tree recorded a great er false negative= 18.39% in comparison to the logistic
regression FN=11.69% , resulting in a superior misclassification rate.
3.4. Mean squared error
While this measure is specific to the accuracy of logistic regression, it is worth commenting
on to asses s the quality of the parametric model in other terms. Mean squared error reports
how close the predicted values of observations were to the real observations in the data set .
Despite the MSE on the test set was slightly bigger on the test set, the differenc e was
minimal i.e. 0.02, which was a sign of a lack of overfitting and a decreased lev el of variance
in the logistic regression model . This means that the logistic regression managed most of
the times to detect which customers were about churn and those wh ich were not with a
high degree of accuracy , especially on data which did not resemble the distribution used in
the training set.
3.5. AUC
Both models had an AUC very close to the value of 1, which was expected taking into
account the fact that each of th e model’s TP rates were very high and the FP fairly low . This
was confirmed also graphically by looking at the curves ROC Figure 1 . It can be seen how the
curve of the logistic regression on the test set was a little close r to touching the sensitivity
line with a value of 0.932 compared to that of the decision tree which was 0.894. On the
training set the scenario was the opposite. All in all, the logistic regression was the model
that had the ability to differentiate better between churners so as to find the real ones,
while keeping the rate of error low , in comparison to the decision tree.
Page 47 of 106
3.6. Lift
The baseline model is represented as an imaginary, horizontal line starting at value 1 in
Figure 2. A lift value of 1 meant that the models were not better at predicting churn than a
random guess , or no concrete model. Logistic regression scored 4.04 and the decision tree
3.91. Assuming a random model had been used to predict churn among a random 10% of
Clear haus’ customers, the random model would have predic ted 10% of the churners , while
the decision tree would have predicted correctly about 39 ,1% of them , and the logistic
regression 40,4% . This means the decision tree was 3.91 times better at predicting churn in
a random sample while the logistic regression was 4.04 better. However, had the two
developed models been used to predict churn among the top 10% (top-decile) customers at
risk of churning, the logistic regression would have predicted correctly 41.3% of the
customers, while the decision tree 33%, thus , the former outperforming the latter. This can
be seen in Figure 3 .
3.7. Loss function
The loss function was designed in order select models using a cost -based method to
complement the other criteria, as well as overcome the bias that may have came with
accuracy and misclassification. The function was constructed in such a way to penalise the
models that has the greatest amount of Type I error. Since the decision tree had this
statistic in a greater percentage on the test set, using it would have cost Clearhaus around
32,000 euros. However, the logistic regression was better at reducing the number of
churners classified incorrectly as churners. Thus, the logistic regression represent ed the
most cost -effective model with a loss of approx. 27,000 euros , a d ifference of 5,000 to the
decision tree.
Page 48 of 106
Figure 1. ROC chart. Source: output in SAS Enterprise Miner.
Page 49 of 106
Figure 218. Lift. Source: output in SAS Enterprise Miner.
18 Note. Red l ine represents the logistic regression. Blue line represents the C5.0 decision tree. The red dot
represents a random 10% of the merchants in the training set.
Page 50 of 106
Figure 319. Cumulative lift. Source: output in SAS Enterprise Miner.
19 Note. Red line represents t he logistic regression. Blue line represents the C5.0 decision tree . The red
represents the top 10% merchants most likely to churn in the test set.
Page 51 of 106
3.8. Model selection and past findings
The aim of this paper was that of developing a statistical model in order to answer the
research question . The research question was interested in what factors can be used to
predict churn among Clear haus’ merchants. Two well studied data models have been
developed in order to answer the research question:
1. A parametric model – a logistic regression with main effects ;
2. A non -parametric model – a decision tree grown with the C5.0 algorithm .
Firstly, t he logistic regressio n outperformed the decision on all the five classic criteria
(Accuracy, Misclassification rate, AUC, Lift, and Top-decile Lift) , as well as represented the
most cost effective model, producing a loss of 27,000 euros, smaller with 5,000 than the
decision tr ee, sum of money which for any company is important.
Secondly, b esides the higher scores on all of the criteria, the logistic regression was more
robust than the decision tree by offering consistent predictions on both the training and test
set i.e. low v ariance. Decisions tree tend to overfit the training data (Breiman, 2001; Saradhi
& Palshikar, 2011). Fact which was observed in this paper when the decision displayed
important differences in the performance scores between the train and test set.
The per formance of the logistic regression came at the cost an increase in the level of bias
due to the multiple imputations performed usin g. However, based on the fact that gains
from decreased variance overcome the liabilities of bias, and on the above two para graphs,
the model that was selected as the best one for predicting churn among Clear haus’
merchants was the logistic regression with main effects .
On the subject variables, the factors that had the highest importance in predicting churn
among Clerarhaus’ m erchants were:
1. Merchant_currency ;
2. Merchant_group ;
3. Merchant_country ;
4. Log_ref_num ;
5. Recency ;
While the less significant ones were:
6. Cb_num
7. Pay_held_num
8. other_debit_credit_num
9. debit_num
10. credit_num
Page 52 of 106
11. monetary
12. cb_val
Before relating the most important variables to the literature, it has to be said that thi s
paper started with three generic variables for studying churn: recency, frequency and
monetary. Among these three, the only ones found to statistically predict churn, though
with different degrees were: recency and monetary , i.e. the greater number of days since
the last transaction is a positive indication of a churn Similar findings have been seen in the
works Chen et al. (2015), Gordini & Veglio (2017), Nie et al. (2011), Tamaddoni Jahromi,
Stakhovych & Ewing (2014), Buckinx & Van den Poel (2005) , Benoit and Van den Poel (2012).
Next, the fact that the location of the customer, its /his business model , transactional
information from accounts , partnership information, as well as card information have been
studied in several industries across B2B and B2C settings in the form of company
description, socio -demographics and financial information by Verbeke et al. (2012), Kumar
& Ravi (2008), Chye & Gerry (2002), Nie et al. (20011) and Larivière & Van den Poel (2005) ,
such variables related to the merchant acquiring industry have never been studied to the
knowledge of this paper, which represents a significant discovery.
Page 53 of 106
CHAPTER VI. Final overview
This paper had the obje ctive of studying churn in a B2B context using the data base of a
merchant acquirer. A merchant acquirer represents a financial institution the offers e –
commerce businesses the opportunity to accept online payments. In order to do so, a
dataset consisting of 6458 observations w ere provided from the company’s database which
was distributed across 25 variables, 19 interval, 6 nominal and 2 binary ones , which included
the traditional RFM (recency, frequency, monetary) variables. The methods for studying
churn in these were a logistic regression with main effects and a C5.0 decision tree. Before
prediction can begin, the data cleaned and refined according to data mining principles in
order to accommodate the logistic regression. As a result , outlier removal and variable
dropping and took place. The most important aspect of data mining was the use of the fully
conditional specification model with predictive mean matching in order to replace missing
data with multiple imputations s imilar to the ones present in the specific columns where
values were missing. Using the procedure, several complete samples have been created
which proved their superiority over a sample that would have used case deletion. The
sample that best reflected the means of the non -missing values was used. After all these
steps, a dataset consisting of 5814 variables represented the final sample. The sample was
divided into a training and test set so as the models to be developed and then evaluated. As
an additional evaluation criteria , a customise d loss function was created to compensate for
any bias offered by the accuracy and misclassification measures. The logistic regression
managed to outperform the C5.0 decision tree. The variables that were found to have the
highest importance in predicting churn were: customer account currency, c ustomer
business model, customer gateway, customers country, refund number and recency.
1. Managerial implications
The implications of this paper are relevant not only to Clear haus, but to any B2B company
interested in studying the churn of its customers and the factors that can affect the rate of
churn.
First of all, the findings of thi s paper reconfirm the importance of being able to detect those
customers that are about to churn . By failing to identify which custo mers are actually at risk
of churning, companies can incur high costs associated with losing the revenues provided by
the customers, and those spent on marketing campaigns for customers which were not
going to churn at all. It was seen how a small percenta ge difference in misclassification has
lead to losses of thousands of euros. As a result, companies must implement a statistical
model capable of predicting churn which is both robust and easy to understand. A model
with this profile is logistic regression .
By using a logistic regression model, marketing managers can study with a high degree of
precision the profiles of customers that are about to churn and act upon their inferences.
For example, by knowing that customers from certain countries are more lik ely to churn ,
offers marketing managers to know better their customer base and send tailored offers . For
Page 54 of 106
example, marketing managers can send out e -mails with discounts in the local language of
the customer in order to create more familiarity between them and the company.
Just as importantly, knowing that some customers tend to churn due to them being br ought
to the company via certain partners can help marketing managers update strategy. By
ending business relationships with partners that bring low qualit y customers saves a
company time and money which could be spent on partnering with organisation which
better reflect and serve their vision.
2. Literature implications
Firstly, but not most importantly, this study contributed to the academic and researc h world
by analysing churn in the B2B setting, and in an industry – merchant acquiring ; where such a
task has not been undertaken before and was reflected by the amount of literature
available.
Second and more importantly , this paper used a collection of data which contained several
new variables to study churn, besides the very popular Recency, Frequency and Monetary
(RFM) variables. This proved to be important taking into account that only Recency was
found to have a high statistical importance in predict ing churn.
Thirdly, from methodological point of view, this piece of academic research confirmed the
power and usefulness of two statistical techniques (one parametric and one non –
parametric ), and account ed for the model’s assumption s, as well as describ ing the
performed adaptations and optimizations in detail. Within the literature of statistical
learning and data mining , discussing model assumptions especially tends to lack popularity,
researchers preferring to proceed very rapidly to presenting and dis cussing the results of
their models. Ottebarch et al. (2014) studied, with a focus on logistic regression, that
detailed reporting of assumptions took place less than 20% of the times in specialty
literature.
Fourthly, on the subject of data processing, t his paper employed a complex procedure used
for imputing missing data which confirmed the superiority of multiple imputations over
single imputations.
Fifthly, this paper reconfirmed the power of a parametric model to produce robust results,
at the expense of very little bias.
Page 55 of 106
3. Limitations
This study suffers from a variety of limitations. Firstly, the sample size and number of
variables used in developing and solving the research question at hand was very limited
compared to other data mining cases where the number of observations and variables tends
to be in the hundred -thousands. As a result, the possibility of generalising the results of this
paper was r ather low.
Second, the logistic regression used in this paper did no completely fulfil the as sumption of
linearity needed between the log -odds of the dependent variable and the independent
variable s. Also, the Fully Conditional Specification method used for imputing missing values
suffered violations as well and need to be absorbed accordingly.
Third, some of the variables , for example monetary, could receive further transformations
such a s logarithms in order diversified information about its effects.
4. Future r esearch
Research in the future should first of all focus on managing the limitations mentioned in the
limitations section. One clear recommendation for the future is for the researcher to
experiment with models which could adapt better the non -linear relationship between the
log-odds of the dependent variable in this study and the indepen dent variables. Suggestions
include polynomial regressions, splines and support vector machines. On the subject of
data, researchers should make efforts to experiment especially also with detailed fraud
data , not just sum figures, as well with bankruptcy d ata due to it being a common cause of
churn among companies.
Page 56 of 106
References
Nie, G, Rowe, W, Zhang, L, Tian & Shi, Y 2011, ‘Credit card churn forecasting by logistic
regression’, Expert Systems with Applications, vol. 38, no. 12, pp. 15273 -15285.
James, G, Witten, D, Hastie, TH & Tibshirani 2013, An Introduction to Statistical Learning
with Applications in R, Springer.
Berry, MJA & Linoff, GS 2004, Data Mining Techniques For Marketing, Sales, and Customer
Relationship Management , second edition, Wiley Publishing, Inc. Indianapolis, Indiana.
Hastie, T, Tibshirani, R & Friedman, J 2009, The Elements of Statistical Learning Data Mining,
Inference, and Prediction, Second Edition, Springer.
Buckinx, W & Van den Poel, D 2005, ’Customer base analysis: partial defection of
behaviourally loyal clients in a non -contractual FMCG retail setting’, European Journal of
Operational Research, vol. 1, no. 1, 1 July 2005, pp. 252 -268.
Lee, TS, Chiu, CC, Chuo, YC & Lu, CJ 2006, ‘Mining the customer credit using clas sification
and regression tree and multivariate adaptive regression spline’, Computational Statistics &
Data Analysis, vol. 50, no. 5, 24 February 2006, pp. 1113 -1130.
Rutledge, DN & Barros AS 2002, ‘Durbin -Watson statistic as a morphological estimator of
information content’, Analytica Chimica Acta, vol. 454, no. 2, 11 March 2002, pp. 277 -295.
Al-Ahmadi, K, Al -Ahmadi, S & Al -Amri, A 2014, ’Exploring the association between the
occurrence of eartquakes and the geologic -tectonica variables in the Red Sea u sing logistic
regression and GIS’, Arabian Journal of Geosciences, vol. 7, no. 9, pp. 3871 -3879.
Ottenbacher, KJ, Ottenbacher HR, Tooth L & Ostir, GV 2004, ‘A review of two journals found
that articles using multivariable logistic regression frequently di d not report commonly
recommended assumptions’, Journal of Clinical Epidemiology, vol. 57, no. 11, pp. 1147 –
1152.
Field, AP 2000, Discovering Statistics Using SPSS for Windows: Advanced Techniques for
Beginners, Sage, London.
Sarkar, SK, Habshah & Rana, S 2011, ‘Detecting Outliers and Influential Observations in
Binary Logistic Regression: An Empirical Study’, Journal of Applied Sciences, vol. 11, no. 1,
pp. 26 -35.
Box, GEP & Tidwell, PW 1962, ‘Transformation of the Independent Variables’,
Technometrics, vol. 4, no. 4, pp. 531 -550.
Mitchell, TM 1997, Machine Learning, McGraw -Hill Science/Engineering/Math.
Page 57 of 106
Coussement, K & Van den Poel, D 2008, ’Churn prediction in subscription services: An
application of support vector machines while comparing two paramet er-selection
techniques’, Expert Systems with Applications, vol. 34, no. 1, January 2008, pp. 313 -327.
Patil, N, Lathi, R & Chitre, V 2012, ‘Comparison of C5.0 & CART Classification algorithms
using pruning technique’, International Journal of Engineering Research & Technology
(IJERT), vol. 1, no. 4, June 2012, pp. 1 -5.
Tufféry, S 2001, Data Mining and Statistics for Decision Makng, John Wiley & Sons, LTD.
Lemmens, A & Croux, C 2006, ‘Bagging and Boosting Classification Trees to Predict Churn’,
Journal o f Marketing Research, May 2006, vol. 43, no. 2, pp. 276 -286.
Chen, K, Hu, YH & Hsieh 2015, ‘Predicting customer churn from valuable B2B customers in
the logistic industry: a case study’, Information Systems and e -Business Management,
August 2015, vol. 13, no. 3, pp. 475 -494.
Kumar, DA & Ravi V 2008, ‘Predicting credit card customer churn in banks using data
mining’, Int. J. Data Analysis Techniques and Strategies, vol. 1, no. 2008.
Miguéis, VL, Van den Poel, D, Camanho, AS & e Cunha, JF 2012, ‘Modeling par tial customer
churn: On the value of first product -category purchase sequences’, Expert Systems with
Applications, vol. 39, no. 12, 15 September 2012, pp. 11250 -11256.
Coussemenent, K, Benoit, DF & Van den Poel, D 2010, ’Impvoed marketing decision making
in a customer churn prediction context using generalized additive models’, Expert Systems
with Applications, vol. 37, no. 3, 15 March 2010, pp. 2132 -2143.
Hanley, JA & McNeil, BJ 1982, ‘The meaning and use of the are under a receiver operating
characteristi c (ROC) curve’, Radiology, no. 143, no. 1, pp. 29 -36.
Hill, S, Provost, F & Volinsky, C 2006, ‘Network -Based Marketing: Identifying Likely Adopters
via Consumer Networks’, Statistical Science, vol. 21, no.2, pp. 256 -276.
Takahashi, K, Takamura, H & Okumur a, M 2009, ’Direct estimation of class membership
probabilities for multiclass classification using multiple scores’, Knowledge and Information
Systems, May 2009, vol. 19, no. 2, pp.185 -210.
Nieslin, SA, Gupta S, Kamakura, W, Lu, J & Mason, CH 2006, ‘Defe ction Detection: Measuring
and Understanding the Predictive Accuracy of Customer Churn Model’, American Marketing
Association, vol. xliii, pp. 204 -211.
Benoit, DF & Van den Poel, D 2012, ’Improving Customer Retention In Financial Services
Using Kinship Ne twork Information’, Expert Systems with Applications, vol. 39, no. 13, pp.
11435 -11442.
Tamaddoni Jahromi, A, Stakhovych, S & Ewing, M 2014, ‘Managing B2B customer churn,
retention and profitability’, Industrial Marketing Management, vol. 43, no. 7, pp. 1 258-
1268.
Page 58 of 106
Gupta, S, Lehmann, DR & Stuart, JA 2004, ‘Valuing Customer’, Journal of Marketing
Research, vol. 41, no. 1, pp. 7 -18.
Drummond, C & Holte, RC 2006, ‘Cost curves: An improved method for visualizing classifier
performance’, Machine Learning, vol. 6 5, no. 1, pp. 95 -130.
Feng, DC, Wang, Z, Shi, JF & Dias Pereira, JM 2008, ‘Research on missing values estimation in
data mining’, 2008 7th World Congress on Intelligent Control and Automation, Chongqing,
pp. 2048 -2052.
Rubin, DB 1976, ‘Inference and missi ng data’, Biometrika, vol. 63, no. 3, pp. 581 -592.
Dong, Y & Peng, CYJ 2013, ‘Principled missing data methods for researchers’, SpringerPlus
2013, vol. 2, no.1, pp.1 -17
Little, RJA & Rubin, DB 2002, Statistical Analysis with Missing Data, 2nd Edition, Wiley, New
York.
Allison, PD 2001, Missing Data, SAGE Publications, Inc.
Little, RJA & Schenker, N 1995, Missing Data. In: Arminger, G, Clogg, CC, Sobel, ME,
Handbook for Statistical Modeling for the Social and Behavioral Sciences, pp. 39 -75, Plenum
Press, New York.
Cohen, J & Cohen, P 1983, Applied multiple regression/correlation analysis for behavioural
sciences 2nd edition, Erlbaum, Hillsdale, NJ.
Cohen, J, Cohen, P, West, S & Aiken, L 2003, Applied multiple regression/correlation analysis
for the behav ioural sciences 3rd edition, Erlbaum, Mahwah, NJ.
Rässler, S, Rubin, DB & Zell, ER 2013, ‘Imputation’, WIREs Compu Stat, vol. 5, pp. 20 -29.
Hwang, H, Jung, T & Suh, E 2004, ‘An LTV model and customer segmentation based on
customer value: a case study on t he wireless telecommunication industry’, Expert Systems
with Applications, vol. 26, no. 2, pp. 181 -188.
Acock, AC 2005, ‘Working With Missing Values’, Journal of Marriage and Family, vol. 67, pp.
1012 -1028.
Azur, MJ, Stuart, EA, Frangakis, C & Leaf, P 201 1, ‘Multiple Imputation by Chained Equations:
What is it and how does it work?’, International Journal of Methods in Psychiatric Research,
vol. 20, no. 1, pp. 40 -49.
Schafer, JL 1997, Analysis of Incomplete Multivariate Data , Chapman & Hall/CRC, London.
Rubin, DB 1987, Multiple Imputation for Nonresponse in Surveys, John Wiley & Sons, Inc,
New York.
Vink, G, Frank, L, Pannekoek, J & van Buuren, S 2014, ’Predictive mean matching imputation
of semicontinous variables’, Statistica Neerlandica, vol. 68, no. 1 , pp. 61 -90.
Page 59 of 106
Raghunathan, TE, Lepkowski, J, Van Hoewyk, JH & Solenberger, PW 2001, ‘A Multivariate
Technique for Multiply Imputing Missing Values Using a Sequence of Regression Model’,
Survey Methodology, vol. 27, no. 2, pp. 85 -96.
van Buuren, S, Boshuizen , HC & Knook DL 1999, ‘Multiple imputation of missing blood
pressure covariates in survival analysis’,
van Buuren, S, Brand, JPL, Groothuis -Oudshoorn, CGM & Rubin, DB 2006, ‘Fully conditional
specification in multivariate imputation’, Journal of Statistic al Computation and Simulation ,
vol. 76, no.13, pp. 1049 -1064.
Heitjan, DF & Little, RJA 1991, ‘Multiple Imputation for Fatal Accident Reporting System’,
Journal of the Royal Statistical Society. Series C (Applied Statistics), vol. 40, no. 1, pp. 13 -29.
SAS 2018, Predictive Mean Matching Method for Monotone Missing Data . Available from:
https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#s
tatug_mi_sect020.htm [01 August 2018]
Burez, J & Van den Poel, D 2009, ’Handling class im balance in customer churn prediction’,
Expert Systems with Applications, vol. 36, pp. 4626 -4636.
Schenker, N & Taylor, JMG 1996, ‘Partially parametric techniques for multiple imputation’,
Computational Statistics & Data Analysis, vol. 22, no. 4, pp. 425 -446.
Di Zio, M & Guarnera, U 2009, ‘Semiparametric predictive mean matching’, ASta Advances
in Statistical Analysis, vol. 93, no. 2, pp. 175 -186.
Feng, C, Wang, H, Lu, N, Chen, T, He, H, Lu, Y & Tu, XM 2014, ‘Log -transformation and its
implications for dat a analysis’, Shanghai Archives of Psychiatry, vol. 26, no. 2, pp. 105 -109.
Japowicz, N 2000, ‘Learning from Imbalanced Data Set: A Comparison of Various Strategies’,
AAAI Tech Report WS -00-05.
Huang, B, Kechadi, MT & Buckley 2012, ‘Customer churn predictio n in telecommunications’,
Expert Systems with Applications, vol. 39, no. 1, pp. 1414 -1425.
Verbeke, W, Dejaeger, K, Martens, D, Hur, J & Baesens, B 2012, ’New insights into churn
prediction in the telecommunication sector: A profit driven data mining appr oach’,
European Journal of Operational Research, vol. 218, no. 1, pp. 211 -229.
Gordini, N & Veglio, V 2017, ‘Customer churn prediction and marketing retention strategies.
An application of support vector machines based on the AUC parameter -selection techn ique
in B2B e -commerce industry’, Industrial Marketing Management , vol. 62, pp. 100 -107.
Weiss, GM & Provost, F 2001, ‘The Effect of Class Distribution on Classifier Learning: An
Empirical Study’, Technical Report ML -TR-44, Department of Computer Science, Rutgers
University August 2, 2001.
Field, A 2009, Discovering Statistics Using SPSS Third Edition (and sex and drugs and rock ‘n’
roll 3rd edition , SAGE Publications, Los Angeles.
Page 60 of 106
PennState Eberly College of Science, 2018, More on Goodness -of-Fit and Lik elihood ratio
tests. Available from: https://onlinecourses.science.psu.edu/stat504/node/220/ [01 August
2018]
Tauscher, N 2013, ‘What is the -2LL or the Log -likelihood Ratio’ , Certara , 28 October 2013.
Available from: https://www.certara.com/2013/10/28/wha t-is-the-2ll-or-the-log-likelihood –
ratio/? [01 August 2018]
Brennan, R, Canning, L & McDowell, R 2014, Business -to-Business Marketing 3rd editions,
Sage Publications Ltd, London.
Glady, N, Baesens, B & Croux, C 2009, ‘Modeling churn using customer lifetim e value’,
European Journal of Operational Research , vol. 197, no. 1, pp. 402 -411.
Rauyruen, P & Miller, KE 2007, ‘Relationship quality as a predictor of B2B customer loyalty’,
Journal of Business Research, vol. 60, no.1, pp.21 -31.
Dumitrescu, AT 2018, Th e analysis of the chargeback rate in acquiring, Internship report,
Aarhus University BSS, Department of Economics and Business Administration.
Athanassopoulos, AD 2000, ‘Customer Satisfaction Cues to Support Market Segmentation
and Explain Swtiching Behav ior’, Journal of Business Research, vol. 47, no. 3, pp. 191 -207.
Risselada, H, Verhoef, PC & Bijmolt, T 2010, ’Staying Power of Churn Prediction Models’,
Journal of Interactive Marketing, vol. 24, no. 3, pp. 198 -208.
Colgate, MR & Danaher, PJ 2000, ‘Imple menting a customer relationship strategy: The
asymmetric impact of poor versus excellent execution’, Journal of the Academy of
Marketing Sciences, vol. 28, no 375.
Van den Poel, D & Larivière. B 2004, ‘Customer attrition analysis for financial services us ing
proportional hazard models’, European Journal of Operational Research, vol. 157, no. 1, pp.
196-217.
Eriksson, K & Vaghult, A, ‘Customer retention, purchasing behaviour and relationship
substance in professional services’, Industrial Marketing Managem ent, vol. 29, no. 4, pp.
363-372.
Kalwani, MU & Narayandas, N 1995, ‘Long -term manufacturer -supplier relationships: Do
they pay off for supplier firms?’, Journal of Marketing, vol. 59, no. 1, pp.1.
Benoit, DF & Van den Poel 2009, ’Benefits of quantile re gression for the analysis of customer
lifetime value in a contractual setting: An application in financial services’, Expert Systems
with Applications, vol. 36, no. 7, pp. 10475 -10848.
Ganesh, J, Arnold, MJ & Reynolds, KE 2000, ‘Understanding the customer base of service
providers: An examination of the differences between switchers and stayers’, Journal of
Marketing, vol. 64, no. 3, pp. 65 -87.
Reichheld, FF 1996, ‘Learning from customer defections’, Harvard Business Review, vol. 74,
no. 2, pp. 56 -59.
Page 61 of 106
Reichheld, FF & Sasser, WE 1990, ‘Zero defections – quality comes to services’, Harvard
Business Review , vol. 68, no. 5, pp. 105 -111.
Lam, SY, Shankar, MK & Erramilli, BM 2004, ‘Customer value, satisfaction, loyalty, and
switching costs: An illustration from a business -to-business service context’, Journal of the
Academy of Marketing Science, vol. 32, no. 3, pp. 293 -311.
Clear haus 2018, About us. Available from: https://www. clear haus.com/about/. [01 August
2018]
Martínez -López, FJ & Cassilas, J 2012, ‘Artifi cial intelligence -based systems applied in
industrial marketing: An historical overview, current and future insights’, Industrial
Marketing Management, vol. 42, no. 4, pp. 489 -495.
Wiersema, F 2013, ‘The B2B agenda, The current state of B2B marketing and a look ahead’,
Industrial Marketing Management, vol. 42, no. 4, pp. 470 -488.
Yu, X, Guo, S, Guo, J & Huang, X 2011, ‘An extended support vector machine forecasting
framework for customer churn in e -commerce’, Expert Systems with Applications, vol. 38,
no. 3, pp. 1425 -1430.
Blattberg, RC, Kim, BD & Neslin, SA 2008, Database Marketing: Analyzing and Managing
Customers , Springer.
Coussement, K & De Bock, KW 2013, ’Customer churn prediction in the online gambling
industry: The beneficial effect of ensemble l earning’, Journal of Business Research, vol. 66,
no. 9.
Chye, KH & Gerry, CHL 2002, ‘Data Mining and Customer Relationship Marketing in the
Banking Industry’, Singapore Management Review , vol. 24, no.2, pp. 1 -27.
Breiman, L 2001, ‘Random forests’, Machin e Learning, vol. 45, no. 1, pp. 5 -32.
Saradhi, VV & Palshikar, GK 2011, ‘Employee churn prediction’, Expert Systems with
Applications, vol. 38, no. 3, pp. 1999 -2006.
Kumar, DA & Ravi, V 2008, ‘Predicting credit card customer churn in banks using data
mini ng’, Int. J. Data Analysis Techniques and Strategies, vol. 1, no. 1.
Keram ati, A, Ghaneei, H & Mirmohammadi, SM 2016, ‘Developing a prediction model for
customer churn from electronic banking services using data mining’, Financial Innovation,
vol. 2, no. 10.
IBM 2018, Available from: Method (Multiple Imputation). Available from:
https://www.ibm.com/support/knowledgecenter/en/SSLVMB_24.0.0/spss/mva/idh_idd_mi
_method.html [10 August 2018]
SAS Institute 2018, SAS Enterprise Miner Workstation 14.2 Reference H elp, software
program, SAS Institute Inc.
Page 62 of 106
Wei, CP & Chiu, IT 2002, ‘Turning telecommunications call details to churn prediction: a data
mining approach’, Expert Systems with Applications, vol. 23, no. 2, pp. 103 -112.
Kleinke, K 2018, ‘Multiple imputation by p redictive mean matching when sample size is
small’, Methodology: European Journal of Research Methods for the Behavioral and Social
Sciences, vol. 14, no. 1, pp. 3 -15.
Lee, K & Carlin, JB 2016, ‘Multiple imputation in the presence of non -normal data’ , vol. 36, no. 4.
Larivière,B & Van den Poel, D 2005, ‘Predicting custom er retention and profitability by using
random forests and regression forests techniques’, Expert Systems with Applications, vol.
29, no. 2, pp. 472 -484.
Page 63 of 106
Appe ndice s
Appe ndix 1.Variable description
Appendix 2. MCC names. https://www.dm.usda.gov/procurement/card/card_x/mcc.pdf
Appendix 3. Tabulated pattern
Page 64 of 106
Appendix 4. Little’s te st.
Page 65 of 106
Appendix 5. Pattern of missing value
Page 66 of 106
Appendix 6 . MI datas sets
physical_delivery
Data Imputation Category N Percent
Original Data 0 189 4.1
1 4409 95.9
Impute d Values 1 0 343 25.7
1 991 74.3
2 0 423 31.7
1 911 68.3
3 0 353 26.5
1 981 73.5
4 0 358 26.8
1 976 73.2
5 0 386 28.9
1 948 71.1
Complete Data After
Imputation 1 0 532 9.0
1 5400 91.0
2 0 612 10.3
1 5320 89.7
3 0 542 9.1
1 5390 90.9
4 0 547 9.2
1 5385 90.8
5 0 575 9.7
1 5357 90.3
Page 67 of 106
log_ref_val
Data Imputation N Mean Std. Deviation Minimum Maximum
Original Data 3236 9.8967 2.37309 1.7918 16.5302
Imputed Values 1 2696 8.9837 2.16441 2.0794 15.8192
2 2696 8.9678 2.19873 1.7918 15.6877
3 2696 9.0288 2.13818 1.7918 16.4402
4 2696 9.0206 2.17859 1.7918 15.8918
5 2696 8.9021 2.21779 1.7918 16.5302
Complete Data After
Imputation 1 5932 9.4818 2.32530 1.7918 16.5302
2 5932 9.4745 2.34144 1.7918 16.5302
3 5932 9.5022 2.30995 1.7918 16.5302
4 5932 9.4985 2.32780 1.7918 16.5302
5 5932 9.4447 2.35626 1.7918 16.5302
log_ref_num
Data Imputation N Mean Std. Deviation Minimum Maximum
Original Data 3237 1.6921 1.54055 .0000 7.7213
Imputed Values 1 2695 1.0269 1.12498 .0000 7.7129
2 2695 1.0414 1.14984 .0000 6.4505
3 2695 1.0463 1.12166 .0000 6.2344
4 2695 1.0505 1.14075 .0000 6.6280
5 2695 .9620 1.10862 .0000 6.6280
Complete Data After
Imputation 1 5932 1.3899 1.40693 .0000 7.7213
2 5932 1.3965 1.41436 .0000 7.7213
3 5932 1.3987 1.40348 .0000 7.7213
4 5932 1.4006 1.40997 .0000 7.7213
5 5932 1.3604 1.40900 .0000 7.7213
Page 68 of 106
Appen dix 7. Groups and pooled d ata
Page 69 of 106
Page 70 of 106
Appendix 8. Groups coefficient
Page 71 of 106
Appen dix 9. Variabl e distribution.
Appendix 9.
2. Frequency (log transformed)
3. Cb_num (initial)
Page 72 of 106
4. Cb_num (log transformed)
5. Monetary (initial)
Page 73 of 106
6. Monetary (log transformed)
7.Cb_val (initial)
Page 74 of 106
8. Cb_val (log transformed)
9. Rec_trans_n umer (initial)
Page 75 of 106
10.Rec_trans_num (log transformedl)
11.D3_trans_num (initial)
Page 76 of 106
12. D3_trans_num (log transformed)
13. Visa_num (initial)
Page 77 of 106
14. Visa_num (log transformed)
15. Mastercard_num (initial)
Page 78 of 106
16. Mastercard_num (log trans formed)
Page 79 of 106
17. Credit_num (initial)
18. Credit_num (log transformed)
Page 80 of 106
19. Debit_num (initial)
20. Debit_num (log transformed)
21. Other_debit_credit (initial)
Page 81 of 106
22. Other_credit_debit (log transformed)
23. Recency (initial)
Page 82 of 106
24. Recency (log transformed)
25. Length (initial)
Page 83 of 106
26. Length (log transformed)
27. Total_fraud_cases (initial)
Page 84 of 106
28. Total_fraud_cases (log transformed)
Page 85 of 106
29. Pay_rel_num (initial)
30. Pay_rel_num (log transformed)
Page 86 of 106
31. Pay_held _num (initial)
32. Pay_held_num (log transformed)
Page 87 of 106
33. I_ref_num (initial)
34. I_ref_num (log transformed)
Page 88 of 106
35. I_ref_val (initial)
36. I_ref_val (log transformed)
Page 89 of 106
Appendix 10. VIF & Durbin -Waston
Page 90 of 106
Durbin -Watson test
mod el <- glm(account_status_at_end_of_period ~ . , data = mydata2,
family = binomial)
dwtest(lm(account_status_at_end_of_period ~. , data=mydata2))
data: lm(account_status_at_end_of_period ~ ., data = mydata2)
DW = 2.0133 p-value = 0.7603
alternative hypothesis: true autocorrelation is greater than 0
Page 91 of 106
Appendix 11. Logit vs independent variables.
Page 92 of 106
Appendix 12. Decision Tree C5.0
Page 93 of 106
Appendix 13. Decision tree rules
Appendix x. Decision Tree C5.0 Rules
*–––- ––––––––––––––––– *
Node = 2
*–––––––––––––––––––– *
if recency < 191.5
then
Tree Node Identifier = 2
Number of Observations = 1293
Predicted: account_status=1 = 0.09
Predicted: account_status=0 = 0.91
*–––––––––––––––––––– *
Node = 7
*–––––––––––––––––––– *
if recency >= 191.5 or MISSING
AND merchant_country IS ONE OF: IE or MISSING
then
Tree Nod e Identifier = 7
Number of Observations = 66
Predicted: account_status=1 = 0.15
Predicted: account_status=0 = 0.85
*–––––––––––––––––––– *
Node = 27
*–––––––––––––––––––– -*
if recency >= 484.5 or MISSING
AND pay_held_num >= 11.5 or MISSING
Page 94 of 106
AND merchant_country IS ONE OF: DK, NL, GB, ES, CY
then
Tree Node Identifier = 27
Number of Observations = 471
Predicted: account_status=1 = 0.96
Predicted: account_status=0 = 0.0 4
*–––––––––––––––––––– *
Node = 36
*–––––––––––––––––––– *
if recency < 251.5 AND recency >= 191.5
AND merchant_country IS ONE OF: DK, NL, GB, ES, CY
AND frequency < 4.5
then
Tree Node Identifier = 36
Number of Observations = 15
Predicted: account_status=1 = 0.00
Predicted: account_status=0 = 1.00
*–––––––––––––––––––– *
Node = 37
*–––––––––––––––– ––––- *
if recency < 484.5 AND recency >= 251.5 or MISSING
AND merchant_country IS ONE OF: DK, NL, GB, ES, CY
AND frequency < 4.5
then
Tree Node Identifier = 37
Number of Observations = 138
Predicted: account_status=1 = 0.69
Predicted: accoun t_status=0 = 0.31
Page 95 of 106
*–––––––––––––––––––– *
Node = 38
*–––––––––––––––––––– *
if recency < 484.5 AND recency >= 191.5
AND merchant_mcc IS ONE OF: GROUP 5000, GROUP 6000, GROUP 4 000 or MISSING
AND merchant_country IS ONE OF: DK, NL, GB, ES, CY
AND frequency >= 4.5 or MISSING
then
Tree Node Identifier = 38
Number of Observations = 521
Predicted: account_status=1 = 0.86
Predicted: account_status=0 = 0.14
*–––––––––––––––––––– *
Node = 41
*–––––––––––––––––––– *
if recency >= 484.5 or MISSING
AND pay_held_num < 11.5
AND merchant_country IS ONE OF: DK, NL, GB, ES, CY
AND gateways_ name IS ONE OF: PENSOPAY, QUICKPAY, CLEAR SETTLE, WEPAY or MISSING
then
Tree Node Identifier = 41
Number of Observations = 281
Predicted: account_status=1 = 0.91
Predicted: account_status=0 = 0.09
*––––––––––––––––– –––- *
Node = 52
*–––––––––––––––––––– *
if recency < 484.5 AND recency >= 191.5
AND merchant_mcc IS ONE OF: GROUP 7000, GROUP 8000
Page 96 of 106
AND merchant_country IS ONE OF: DK, NL, GB, ES, CY
AND mastercard_num < 14
AND frequency >= 4.5 or MISSING
then
Tree Node Identifier = 52
Number of Observations = 45
Predicted: account_status=1 = 0.49
Predicted: account_status=0 = 0.51
*–––––––––––––––––––– *
Node = 53
*––––– ––––––––––––––– *
if recency < 484.5 AND recency >= 191.5
AND merchant_mcc IS ONE OF: GROUP 7000, GROUP 8000
AND merchant_country IS ONE OF: DK, NL, GB, ES, CY
AND mastercard_num >= 14 or MISSING
AND frequency >= 4.5 or MISSING
then
Tree Node Identifier = 53
Number of Observations = 50
Predicted: account_status=1 = 0.82
Predicted: account_status=0 = 0.18
*–––––––––––––––––––– *
Node = 54
*–––––––––––––– –––––– *
if recency < 655.5 AND recency >= 484.5
AND pay_held_num < 11.5
AND merchant_country IS ONE OF: DK, NL, GB, ES, CY
AND gateways_name IS ONE OF: EPAY
then
Page 97 of 106
Tree Node Identifier = 54
Number of Observations = 12
Predicted: account_sta tus=1 = 0.25
Predicted: account_status=0 = 0.75
*–––––––––––––––––––– *
Node = 55
*–––––––––––––––––––– *
if recency >= 655.5 or MISSING
AND pay_held_num < 11.5
AND merchant_ country IS ONE OF: DK, NL, GB, ES, CY
AND gateways_name IS ONE OF: EPAY
then
Tree Node Identifier = 55
Number of Observations = 15
Predicted: account_status=1 = 0.93
Predicted: account_status=0 = 0.07
Page 98 of 106
Page 99 of 106
Depth of decision tree C5.0
Number of observations in leaves decision tree C5.0
Page 100 of 106
Page 101 of 106
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: Master Thesis 201305912 [630646] (ID: 630646)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
