Churn Prediction in the Telecommunications Sector Using [601261]

Churn Prediction in the Telecommunications Sector Using
Bayesian Networks

BRANDUSOIU Ionut , TODEREAN Gavril
Technical University of Cluj -Napoca , Romania
E-Mail: [anonimizat] , gavril. [anonimizat]

Abstract – Customer retention repr esents one of the
focal concern of telecommunications companies. Th is
sector can be considered on the top of the list with an
annual churn rate of approximately 30%. Since the
cost associated with subscriber acquisition is higher
than the cost of subscrib er retention, the need for
predictive models to identify subscribers that are at risk
of churning has become an important aspect for
modern operators . In this paper, we present an
advanced methodology that helps to predict subscribers
churn in pre -paid mob ile telecommunica tions industry
by applying Bayesian networks on a dataset that
contains call details records. We propose an algorithm
for learning the graph structure of a Bayesian
network. To evaluate the model , we use the gain
measure and the ROC curve .

Keywords: Bayesian networks, call details records,
churn prediction, constrained based algorithms,
directed acyclic graph, Markov property, parameters
learning, structure learning, telecommunications .

I. INTRODUCTION

The contracting of new clients and retention of
existing ones represent the main concerns of companies
active in the mobile telecommunications industry. Thus,
identifying customers who are likely to churn is
extremely important. Churn represents the movement
from one service provider to ano ther and is usually
happening due to better rates or services, or due to
different benefits that a competitor offers when signing
up. One significant way to improve customers’ value is
to retain them over a long period of time [1].
In the telecommunicatio ns industry, the segment with
the fastest growth is the mobile market. Globally, over
75% of the phone calls can be made using mobile
phones and like any other competitive market, the
competition has gone from contracting to retention [2].
Losing a client represents the main concern of many
companies active in industries with low switching costs.
Of all the industries that suffers due to this issue, the
telecommunication industry is on top of the list with an
annual churn rate of approximately 30% [3].
In conclusion, a mobile telecommunications
company must identify the customers who are likely to
churn before they are actually going to do it. Therefore,
the need for implementing a model to predict customers likely to churn becomes vital. The model has to i dentify
precisely customers who are going to churn in the near
future, so that the companies do not contact customers
who would have been using the provided services
anyway [4], [5]. In order to identify only the clients who
are likely to churn, the predic tive model has to present a
high level of accuracy. Because of the nature of pre -paid
mobile telecommunication which does not have a
contract, the identification of this customers is not an
easy and well defined task. Therefore, the
implementation of a pre dictive model it is of high
complexity.
The approach taken in this paper is for predicting
churn is based on a data mining process. Data mining is
an interdisciplinary field that includ es machine learning
algorithms, statistics, pattern recognition and
visualization techniques to extract knowledge from data
[6]. By using data mining techniques, one can
implement prediction models to disco ver trends and
future behaviors allowing companies to take smart
decisions based on the knowledge extracted from data.
The machine learning algorithm used in this paper
for building the predictive model is Bayesian networks.
Bayesian networks algorithm appeared at the
intersection of artificial intelligence, statistics and
probabilities, and it represents a formalism for the data
mining process [7-10]. They are part of the general class
of probabilistic graphical models [11], [12], that
appeared from the combination of the graph theory and
probability theory. A probabilistic graphical model is
defined by a graph, in which the nodes are represented
by variables and the edges by dependences between
these variables. The interactions between variables are
modeled by probability distributions. A graphical
probabilistic model is called a Bayesian network if the
graph that connects the vari ables is a directed acyclic
graph (DAG). This DAG represents the conditional
independence assumptions used to factorize the joint
probability distribution of the variables, thus making the
learning process possible. A Bayesian network induced
from data can be used to analyze the relationships
between variables and to make predictions by
calculating the conditional probability distributions of a
variable, given the values of other variables.
Due to the development of new algorithms to
transmit probabilistic information through the graph , the
Bayesian networks are positioned at the forefront of
artificial intell igence research [13], [14]. The machine

learning community realized that the probabilistic nature
of Bayesian networks permits to learn from data. The
first notable approaches consisted in searching for
conditional independence structures in data and
translating them into a Bayesian network [7], [14]. After
a short period of time, in [15] the authors introduced a
new Bayesian method to learn from data, meth od which
was later refined in [16].
This research paper is organized as follows: the
following section highlights all the theoretical aspects
related to Bayesian networks and describes the main
existing algorithms for structure and parameters
learning. Sec tion 3 presents the Bayesian network
architecture used for predicting churn. In Section 4 the
predictive model is evaluated on a call detail records
dataset [17]. Finally, Section 5 concludes the paper.

II. BAYESIAN NETWORKS

Bayesian networks are a class of graphical models
that allows to represent the probabilistic dependences
between a given set of random variables X = {X1, X2,…,
Xp} as a DAG, G = (V, E), where each node vi ∈ V
corresponds to a random variable Xi.
The correspondence between the graphical separation
(⫫G) induced by the absence of a particular edge and
probabilistic independence (⫫P) provides a convenient
way to represent the dependences between variables.
One such correspondence is known as independency
map [7] and it is defined as follows.
Definition 1 (Maps). A graph G = (V, A) is a
independency map (I-map) of the probabilistic
dependence structure P of a set of random variables X if
there exists a one to on e correspondence between the
random variables in X and the nodes in V from G, such
that for all disjoint subsets A, B, and C, there is:
A ⫫P B | C ⇐ A ⫫G B | C (1)
In a similar way, the graph G is a dependency map
(D-map) of the probabilistic dependence structure P of
X if:
A ⫫P B | C ⇒ A ⫫G B | C (2)
The graph G is a perfect map of the probabilistic
dependence structure P if is a D -map an d an I -map. In
this case, P is said to be faithful to G.
A ⫫P B | C ⇔ A ⫫G B | C (3)
The correspondence between G and the conditional
independence relationships are explained by the d-
separation (directed separation ) criterion [7].
Definition 2 (D -separat ion). If the subsets A, B, and C
of the nodes V from G are disjoint, then C d-separates A
from B, denoted as A ⫫G B | C, if along each sequence
of edges between a node in A and a node in B exists a
node v which satisfies one of the following two
conditions:
 Node v has convergent edges ( Figure 1) and
node v or its successors are not in C.
 Node v is in C and does not have converging
edges.
The Markov property of a Bayesian network is
directly derivable from the d -separation criterion, thus the joint pr obability distribution of the random variables
in X (global distribution) can be represented as a product
of the conditional probability distributions (local
distributions associated with each random variable).
This represents a direct application of the c hain rule
[18].
The factorization of the joint probability distribution
PX for discrete random variables is calculated using
equation ( 4) and the factorization of the joint density
function fX for continuous random variables is calculated
using equation (5 ):
1P ( ) P ( | )
iip
X i X
iX
 XX
(4)
1( ) ( | )
iip
X i X
if f X 
 XX
(5)
where πXi is the set of predecessors of Xi.
The three fundamental connections of three nodes
and two edges are represented in Figure 1 [19].

CA B
CA B
CA B
Figure 1. Fundamental connections (top to bottom):
convergent, serial and divergent connec tion [19]

In the case of a convergent connection (v -structure),
edges from node A and B are oriented to node C, thus
violating both conditions in Definition 2. In conclusion,
node C does not d -separate nodes A and B, and it
implies that nodes A and B are not independent given C.
Since πA = {∅}, πB = {∅}, and πC = {∅}, from Markov
property (4), we have:
( , , ) (C | , ) ( ) ( )P A B C P A B P A P B 
(6)
It can be observed that C depends on the joint
probability distribution of A and B, therefore A and B
are not conditionally independent gi ven C.
In the case of a serial and divergent connection, A
and B are conditional independent given C because the
conditions in Definition 2 are satis fied. For the serial
connection πA = {∅}, πB = {C}, and πC = {A}, and
therefore:
( , , ) ( | ) ( | ) ( )P A B C P B C P C A P A 
(7)
For the divergent connection πA = {C}, πB = {C}, and
πC = {∅}, and therefore:
( , , ) ( | ) ( | ) ( )P A B C P A C P B C P C 
(8)
Markov blanket is another fundamental measure
related to Definition 1 and Definition 2 and it represents
the set of nodes that completely d-separates a node from
the rest of the graph.
Definition 3 ( Markov blanket ). The Markov blanket of
a node A ∈ V is the minimal subset S of V, such that:
A ⫫P V – S – A | C (9)

In any Bayesian network, the Markov blanket of a
node A is the minimal set co nsisting of A’s predecessors,
successors and successors’ predecessors (Figure 2).

DA B
G FC
E
I H

Figure 2. The Markov blanket of
node D is {A, B, C, F, G}

The task of fitting a Bayesian network is coming
from the expert systems theory and is known as
learning. Learning a Bayesian network consists of two
steps: structure learning and parameter learning.
The structure learning of a Bayesian network consists
in identifying the graph structure. Ideally, the graph
structure should be the minim al I -map of the
dependence structure or an approximate probability
distribution with the one in the probability space.
Different algorithms have been proposed to learn the
structure of a Bayesian network, which can be grouped
into three main categories: constraint based , score
based , and hybrid algorithms. In this paper, the accent is
put on the constraint based algorithms which are
presented below.

A. Structure Learning

The constraint based algorithms are based on the
Pearl’s work, concerning maps and its appl ication to
causal graphical models. His proposed algorithm, called
Inductive Causation (IC) [20] can learn the structure of a
Bayesian network by using conditional independence
tests.
Two widely used conditional independence tests are
mutual information [21] and Pearson’s Chi -square for
contingency tables [22]. The mutual information is a
distance measure, proportional to the log -likelihood ratio
test G2. Other possible options are Fisher’s exact test
and the shrinkage estimator for the mutual information
defined in [23] and studied in [24].
The first step of IC algorithm identifies the pairs of
variables that are connected by an edge, regardless of
the direction, and which cannot be conditional
independent of any subset of variables because are not
d-separable. The second step identifies the v -structures
among all the pairs of nonadjacent nodes A and B that
have a common neighbor C. The last two steps of this
algorithm identify the constrained edges and recursively
directs them to obtain a complete partia lly directed
acyclic graph.
A major problem with the IC algorithms is the fact
that the first two steps cannot be applied to any real
world problem due to the exponential number of
possible conditional independence relationships. This
has led to the devel opment of new algorithms that can
solve this issue. These new algorithms learn first the Markov blanket of each node from the network. This
preliminary step simplifies the identification process of
each node’s neighbors because the search is limited to
only the Markov blanket of that node. Therefore, the
number of conditional independence tests done by the
algorithm is significantly reduced as well as the global
computational complexity.
Koller and Sahami proposed KS, the first algorithm
based on the Marko v blanket concept [25], which
accepts as parameters the number of retained variables
and the maximum number of allowed variable s for
conditioning. Even though both these settings are useful
to reduce the search space, KS is a heuristic and
approximate alg orithm and it does not guarantee a
correct result.
The Grow -Shrink (GS) algorithm [26] was proposed
to induce the structure of a Bayesian network through
the discovery of local neighbors, i.e. the Markov blanket
of each node. GS contains two independent components:
Grow -Shrink Markov Blanket (GSMB) and Grow -Shrink
Bayesian Network (GSBN). The GSMB component is
responsible to induce t he Markov blanket of a variable
and the GSBN component to induce the entire Bayesian
network by using the information provided by GSM B.
GS resolves the two main shortcomings of previous
researches – the execution time which is exponential and
the proneness to errors in the independence tests.
Moreover, GS is the first algorithm that explicitly uses
the Markov blanket to reduce the numbe r of unnecessary
computations. This is an efficient algorithm, but in order
to reduce the number of conditional tests, the paramet ers
need to be manually defined and therefore it cannot
guarantee a correct result.
Tsamardinos et al. proposed a series of algorithms to
induce the Markov blanket of a variable without having
to learn the entire Bayesian network. The first algorithm
proposed by Tsamardinos et al. is Incremental
Association Markov Blanket (IAMB) and it consist s of
two steps similar to GS: a forw ard selectio n followed by
a backward selection of the Markov blanket. IAMB is
based on an independence test which is considered valid
if it is smaller or equal to a threshold, and invalid
otherwise. The independence test must be efficient in
order to assu re that the number of candidate variables
obtained after the first step is as reduced as possible.
During the second step, all the features that do not
belong to the Markov blanket are removed.
Tsamardinos et al. recognized and evidenced the fact
that in order to obtain a stable decision, a large dataset is
needed. Although, IAMB is theoretically sound, it is
suitable only when the available dataset is sufficiently
large to employ conditional independence tests over the
entire Markov blanket of a variable. In order to avoid
this shortcoming, Tsamardinos et al. proposed other
variants of this algorithm. One of these variants is the
Interleaved Incremental Association Markov Blanket
and PC (Inter -IAMBnPC) [27], which uses two methods
to reduce the dataset dim ension: the first method
interleaves the growing phase and the pruning phase to
limit the Markov blanket during the entire execution of

the algorithm; and the second replaces the shrinking
phase from IAMB with the PC algorithm [28]. The PC
algorithm is the first practical application of the IC
algorithm [29] and learns the Bayesian network by
determining the directed edges between nodes. Two
other variants of IAMB algorithm are Inter -IAMB [27]
and IAMBnPC [30], which are similar to Inter –
IAMBnPC, just that Inter -IAMB uses only the first
method and IAMBnPC only the second, as their names
suggest.
In 2005, Yaramakala and Margaritis proposed the
Fast-IAMB algorithm [31], which similar to GS and
IAMB has a growing and shrinking phase. During the
growing phase, b ased on the conditional independence
tests, Fast -IAMB sorts all the candidate attributes for
admission to the Markov blanket of the target variable.
Since each sorting step involves the calculation of a
statistical test between the target variable and each
candidate attribute, this can be computationally
intensive. The key step of Fast -IAMB algorithm, is the
reduction of these tests by introducing one or more
attributes at a time, after each resorting of the remaining
attributes. These attributes are introd uced based on the
significance of the conditional independence test,
without resorting after each modification of the Markov
blanket as in the case of IAMB. Although the authors
have demonstrated that this algorithm is much faster
than IAMB and Inter -IAMB [31], there is still a trace of
uncertainty because if in the growing phase more false
positives are introduced, then more statistical steps are
expected in the shrinking phase.
All these algorithms are not efficient from the data
point of view [32-34], reducing their practical value,
especially when the cost of data collection is high.
Therefore, other algorithms have been proposed to
reduce the critical requirements regarding the training
data and which perform a local learning of the Markov
blanket.
The first such algorithm is Max-Min Markov Blanket
(MMPC/MB). This algorithm is composed out of two
steps: it first identifies the Markov blanket of the target
variable by calling the Max-Min Parents and Children
(MMPC) subroutine and then identifies the su ccessors’
predecessors of this variable by calling the Max-Min
Markov Blanket (MMMB) subroutine [32]. To decide
whether a statistical test is valid or not , MMMB uses the
same criterion as IAMB and Fast -IAMB . MMMB is
data efficient because the number of samp les needed to
identify the Markov blanket depends on the DAG’s
topology and not on the size of the Markov blanket. In
[33], Pena et al. showed that MMPC does not guarantee
a correct result and even if MMPC produces a correct
result, MMMB is not always corr ect. In [35],
Tsamardinos et al. identified the imperfection in MMPC
and proposed a corrected version, the CMMPC. Still, as
illustrated in [33], even i f MMPC produces correct
results, MMMB does not guarantee a correct result.
In order to induce the Markov blank et more data
efficiently, the authors of the IAMB algorithm proposed
a new algorithm called HITON -PC/MB. This algorithm is similar to MMPC/MB because it first identifies the
predecessors and successors of the target variable by
calling the HITON -PC subro utine and then identifies the
rest of the successors’ predecessors of this variable by
calling the HITON -MB subroutine [32]. Aliferis et al.
proved that HITON -PC and HITON -MB produce
correct results. Despite all of this, in [36], the author
indicates that these two steps does not always produce
the correct outcome. In [34], the experiments done show
that HITON -PC/MB is scalable to problems with
thousands of variables. Even if this algorithm does not
always produce a correct result, it is still recognized a s a
remarkable contribution for learning the Markov blanket
efficiently without having to learn the entire Bayesian
network.
Pena et al., the first who highlighted the
insufficiencies of the MMPC/MB and HITON -PC/MB
algorithms, proposed a similar but sound algorithm
called Parents and Children based Markov Blanket
(PCMB) [33]. In the same way as MMPC/MB and
HITON -PC/MB, PCMB induces the Markov blanket by
recognizing the directed edges, i.e. the predecessors and
successors of any variable of interest. To de cide whether
a statistical test is acceptable or not, the PCMB
algorithm uses the same criterion as the IAMB, MMMB
and HITON -MB algorithms. This algorithm is data
efficient because the number of samples needed to
identify the Markov blanket depends on the DAG’s
topology and not on the size of the Markov blanket. In
[33], it is showed that PCMB is scalable to datasets with
thousands of variables.
In [36], a new algorithm was propo sed to learn the
Markov blanket called Iterative Parent -Child based
learning of Market Blanket (IPC -MB). This algorithm
minimizes the size of the set of conditional
independence tests during search providing a better
efficiency than all previously described algorithms [36].
The IPC -MB algorithm induces the Markov blanket by
iteratively recognizing the predecessors and successors
and the rest of successors’ predecessors. It differs from
the other algorithms through the search of local
neighbors of a variable [36]. Similarly to MMPC/MB,
HITON -PC/MB, and PCMB, it consists of two phases:
the first phase recognizes the predecessors and
successors and filters out the false positives while the
second phase induces the set of the rest of successors’
predecessors obtained during the first phase. The author
demonstrated that this al gorithm is theoretically sound
[36].

B. Parameters Learning

After learning the structure of the Bayesian network,
the task of estimating and updating the global
distribution parameters can be simplified by applying
the Markov property. This matter can be e fficiently
solved by estimating the local distribution parameters
from the structure obtained at the previous step.
The local distributions hav e a reduced number of
variables and their dimensions are not scalable with the
number of variables. This attenu ates the curse of

dimensionality [37] because each local distribution has a
smaller number of parameters to estimate and these
estimates are more accurate due to the better ratio
between the size of the parameter space and the sample
size. Two main approac hes are widely used to estimate
these parameters: Maximum Likelihood Estimation
(MLE) and Bayesian estimation [38].

III. MODELING THE BAYESIAN NETWORKS

In this paper, the Bayesian networks are used for
classification, i.e. to predict the state of a two class
variable based on some independent variables. To
completely avoid the definition of a probabilistic model
for data, the continuous independent variables are
discretized into 5 equal size intervals.
Regarding the prediction of a two class variable,
Pearl and Koller affirm in their papers [7] and [25], that
only the Markov blanket is efficient for the
classification, meaning that is sufficient the partial
Bayesian network centered on the dependent variable
and the corresponding Markov blanket. The part ial
Bayesian network, called Markov blanket classifier
(MBC) or Bayesian network classifier (BNC) , is a
simple classification method that classifies a case
12( , , , )j j j
jnd x x x
by determining the probabilities of
belonging to a class Yi.
The probabil ities are calculated using formula (10):
1 1 2 2 P( | , , , )j j j
i n nY X x X x X x  

1 1 2 2
1 1 2 2P( ) P( , , , | )
P( , , , )j j j
i n n i
j j j
nnY X x X x X x Y
X x X x X x    
(10)
1P( ) P( | , )n
jj
i k k k i
kY X x Y 


where πk is the set of Xk predecessors apart from Y, and
can be an empty set. P( Xk | πk, Y) is the conditional
probabilitie s table associated with each node Xk. If there
are n independent variables then the probability is
proportional with:
1P( ) P( , )n
j
i k k i
kY X x Y

(11)
The independence test verifies if two variables are
conditionally independent given a set of variable s. In
this architecture, we select Pearson’s Chi -square test of
independence because it produced a better predictive
performance compared to log likelihood ratio.

A. Structure Learning

Learning a Bayesian network is known to be an NP –
complete problem and that the complexity grows
exponentially with the number of independent variables
and their states [39]. Thus, in order to avoid this
problem the partial Bayesian network needs to be
induced directly without having to learn the entire
Bayesian network.
In this paper, for the local learning of the Markov
blanket we use the IPC -MB algorithm [36]. In [36], the
author presents the shortcomings of the other constraint based algorithms described in the previous section and
the advantages of using IPC -MB algorithm.
IPC-MB searches first the direct neighbors of the
dependent variable Y based on the independence tests.
These variables are the predecessors and successors of
Y, noted with PC(Y). For each X ∈ PC(Y), IPC -MB
searches the predecessors and successors of X, i.e.
PC(X); and each Z ∈ PC(X) which is not independent of
Y is added to the Markov blanket MB Y.
After the local learning algorithm determined the
Markov blanket, the PC algorithm [29] is applied to
finish the network structure learning [36]. The PC
algorithm learns the Bayesian network structure by
identifying the conditional independence relationships
among variables. Using statistical tests, PC finds the
conditional independence relationships between the
nodes and uses them as constraints to build the Bayes ian
network structure.
It starts from a complete undirected graph G and for
each pair of variables Xi, Yj ∈ X from G it calculates the
Pearson’s Chi -square test of independence I(Xi, Yj) and
if I(Xi, Yj) is greater than a threshold α then it removes
the edge between Xi and Yj. For each remaining edge Xi
− Yj it performs an exhaustive search in all the adj acent
variables set of variable Xi in G ignoring Xj to find the
smallest set S of conditional variables such that I(Xi, Yj)
> α. If such a set S exists then it removes the edge Xi −
Yj. After the structure has been derived, the following
rules are applied to direct the edges in G:
1. For all the cases of the form Xi − Xj − Xk if Xk is not
an element of the subset X \ Xi, Xj such that Xi and Xj
are conditionally independ ent with respect to this
subset, it set s the direction of the edges Xi → Xj ←
Xk;
2. For all the cases of the form Xi → Xj − Xk the
direction of the edges is set Xj → Xk;
3. For all the cases of the form Xi − Xj the direction of
the edges is set Xi → Xj;
4. For all the cases of the form Xi − Xj − Xk, Xi → Xl, Xk
→ Xl, and Xj → Xl the direction of the ed ges is set Xj
→ Xl.
where Xi − Xj represents an undirected edge between
adjacent nodes Xi and Xj, and Xi → Xj represents a
directed edge from node Xi to node Xj (Xi is the
predecessor of node Xj, and Xj is the successor of node
Xi).

B. Parameters Learning

As it was mentioned in the previous section, for
parameters learning in the literature are mentioned two
algorithms: MLE and Bayesian estimation [38].
In this paper, the parameters are estimated using the
Bayesian estimation algorithm. The MLE algorithm
presents few disadvantages when compared to Bayesian
estimation algorithm [38]. For example, in the case of a
sparse dataset with many zeros or with a large number
of possible cases, which is frequent in large datasets,
there can exist positive cases even if the dataset does not
contain any. In these situations, the MLE algorithm

would return a null probability (impossible), while using
the Bayesian esti mation if the prior hypothesis is
modeled so that positive cases could exist then the
resulted probabilit y is greater than zero [38].
For the Bayesian estimation algorithm we use the a
priori Dirichlet distribution because the integral can be
explicitly calculated to obtain a posterior density which
is a Dirichlet distribution too. Thus, we suppose that the
a priori Dirichlet distribution is specified for each set
θijk(1 ≤ i ≤ n, 1 ≤ j ≤ qi, 1 ≤ k ≤ ri) [40].
After observing the dataset D, the posterior Dirichlet
distributions are obtained with the following set of
parameters:
0
0ˆijk ijk P
ijk
ij ijNN
NN
(12)
where
0
ijkN are the corresponding D irichlet distributed
parameters such that
00
ij ijk kNN .
To overcome the problems caused by zero or very
small cell counts the network parameter s can be
estimated as posterior parameters θijk(1 ≤ i ≤ n, 1 ≤ j ≤ qi,
1 ≤ k ≤ ri) using minimal informative Dirichlet priori
02 ( )ijk i iN r q
.

IV. MODEL EVALUATION

In this section we apply the Bayesian networks
algorithm described in the previous section to a dataset
that contains the call detail records of 3333 clients (one
row per client) [17]. This dataset is described in great er
detail in [41] where the authors applied the PCA
algorithm to reduce the dimensionality of the dataset and
remove the collinearity between the independent
varia bles. The dataset used in this paper is formed from
the 11 principal components resulted in [41], the flag
variables International Plan and Voice Mail Plan , and
the dependent variable Churn .
To empirically estimate the generalization error of
the Bayesian networks algorithm this dataset is
randomly partitioned into a training dataset (80%) and a
testing dataset (20%). In the training dataset, the
dependent variable Churn has an unbalanced
distribution for its two classes: 388 (14%) records for
class Yes and 2279 (86%) records for class No. For an
optimal training the distribution of the dependent
variable is balanced by oversampling [42]. This method
randomly clones the records corresponding to the class
Yes until the number of records will be approximat ely
equal to the one for class No. At the end of this cloning
process, the new training dataset has 2 294 (50%) records
for class Yes and 2279 (50%) records for class No, a
total of 4573 records. In the testing dataset the Churn
variable has 95 (15%) records for class Yes and 571
(85%) records for class No.
The predictive model performance can be evaluated
using the confusion matrix in Table 1. T he cells on the
diagonal of the cross -classification o f cases are correct
predictions, whilst those off th e diagonal are incorrect
predictions. In the training dataset, of the 2294 clients that
churned , 2262 are correctly classified a nd the true
positives rate is 98.61%. O f the 2279 clients who did not
churn, 2246 client s are correctly classified and we have
a specif icity of 98.55%. Overall, 98.58% of the clients
from the training dataset are correctly classified.

Table 1. Confusion matrix

Set Observed Predicted
No Yes % correct
Training No 2246 33 98.55%
Yes 32 2262 98.61%
Overall % 49.81% 50.19% 98.58%
Testing No 566 5 99.12%
Yes 1 94 98.95%
Overall % 85.14% 14.86% 99.10%

In the testing dataset, of the 95 clients that churn ed,
94 are correctly classified and the true positives rate is
98.95%. O f the 571 clients who did not churn, 56 6
client s are co rrectly classified and the specificity is
99.12%. Overall, 99.10% of the clients from the testing
dataset are correctly classified.
This suggest that overall, the model will correctly
classify 9 9 out of 10 0 subscribers. But, since our target
is to identify customers that are at risk of churning, the
model has a percent of correct classification of 98.95 %
meaning that approximately 99 out of 1 00 customers
will be c orrectly classified as churners . The model
summaries also tell that because we have almost equal
percentages of incorrect predictions in both, training and
testing datasets , the model does not overtrain .
Another way of interpreting the results is the lift
chart . The lift chart sorts the predicted pseudo –
probabilities in descending order and display the
corresponding curve. There are two types of lift charts:
incremental and cumulative. The incremental lift chart
displayed in Figure 3 shows the lift in each percentil e
without any accumulation for the dependent variable
Churn , class Yes.

Fig. 3 Incremental lift chart

The lift line corresponding to the Bayesian network
model descend s below the random line at about 16th
percentile, meaning that compared to random
expectation our models achieve their best performance
in the first 16% of the re cords.

The cumulative lift chart shows how better the
predic tion rate produced by the model is compared to
the random expectation (red line). In Figure 4, we can
see the curve corresponding to the Bayesian network
model and by reading the graph on the hori zontal axis ,
we can see that for the 16th percentile the model has a
lift index of approximately 7 on the vertical a xis,
meaning that it performs 7 times better than the random
expectation.

Fig. 4 Cumulative lift chart

The performance can be evaluate d also using the gain
measure. The gain chart shows on the vertical axis the
percentage of the correct answers and on the horizontal
axis the percentage of the contacted clients. The gain
measure is defined as the proportion of the respondents
present in each percentile with respect to the total
number of respondents. The cumulative gain chart
illustrates the prediction rate compared to the random
expectation. Figure 5 shows the curve for the Bayesian
network model for the class Yes of the dependent
variab le Churn . One can observe that in the 16th
percentile, the predictive model has a performance of
99%.

Fig. 5 Cumulative gain chart

Figure 6 illustrates the ROC curve corresponding to
the Bayesian network model. The ROC curve is derived
from the confus ion matrix and uses only the true
positive rate (sensitivity) and the false positive rate ( 1 –
specificity). Following the curve, one can observe that
this approaches the coordinate po int (0, 1) from top left
corner, which implies a perfect pr ediction. For the
Bayesian network model, the sensitivity is
approximately 99% and the specificity is 99%.
Fig. 6 ROC curve

If, for instance, a mobile telecom munications
company wants to send offers containing incentives to
their clients to stay with them, they can ea sily select the
top 16% subscribers in this sorted scored list, and expect
to contact more than seven times (lift index of 7) the
number of subscribers that are categorized as churners
than normal.

IV. CONCLUSIONS

Bayesian networks represent a formalism appeared at
the intersection of artificial intelligence, statistics, and
probabilities. A Bayesian network represents the
conditional independence hypotheses used to factorize
the joint probability distribution of network variables,
thus making possible t o learn from data. The learning
process consists in learning the graph structure of a
Bayesian network and estimate the parameters of the
global distribution by applying the Markov property. For
classification, only the Markov blanket is efficient,
meaning that is sufficient the partial Bayesian network
centered on the dependent variable and the
corresponding Markov blanket. For the local learning of
the Markov blanket we use the IPC -MB algorithm and
the PC algorithm to finish the network structure
learning . For parameters estimation, we use the
Bayesian estimation algorithm because it presents few
advantages when compared to MLE.
In this paper we proposed an architecture for
Bayesian networks and built a predictive model for
subscribers churn in the pre-paid mobile
telecommunications industry. By evaluating the results
from the technical point of view, we observe that for
predicting both churners a nd non -churners, the model
has an overall accuracy of 99.10 %.
From the practical point of view, the model ha s a
very good performance of approximately 99% in
predicting churners in a mobile telecommunications
company. Decision making employees can build
different marketing approaches to retain churners based
on the predictors that have higher importance in scoring
the model performance. This churn prediction model can
be used in other customer response models as well, such
as cross -selling, up -selling, or customer acquisition.

REFERENCES

[1] M. Freeman, “The 2 customer lifecycles ”, Intelligent
Enterprise, Vol. 2, Nr. 16, 1999.
[2] H. Kim and C. Yoon, “Determinants of subscriber churn
and customer loyalty in the Korean mobile telephony
market ”, Telecommunications Policy, Vol. 28, 2004.
[3] P. Kotler and L. Keller, Marketing management 12th
Edition , Pearson Prentice Hall, 2006.
[4] S. Neslin , S. Gupta , W. Kamakura , J. Lu , and C. Mason,
“Defection Detection: Measuring and understanding the
predictive accuracy of customer churn models ”, Journal
of Marketing Research, Vol. 43, 2006.
[5] K. Coussement , and D. Van den Poel, “Integrating the
voice of c ustomers through call center emails into a
decision support system for churn prediction ”,
Information and Management, Vol. 45, 2008.
[6] P. Cabena, P. Hadjinian, R. Stadler, J. Verhees and A. J.
Zanasi, Discovering data mining: From concept to
implementation , Prentice Hall, 1998.
[7] J. Pearl, Probabilistic Reasoning in Intelligent Systems:
Networks of plausible inference , Morgan Kaufmann,
California, 1988.
[8] D. Heckerman, “Bayesian networks for Data Mining ”,
Data Mining Knowledge Discovery, Vol. 1, 1997.
[9] D. Madigan and G. Ridgeway, “Bayesian data analysis
for Data Mining ”, Handbook of Data Mining, The MIT
Press, 2003.
[10] D. Madigan and J. York, “Bayesian graphical models for
discrete data ”, International Statistical Review, Vol. 63,
No. 2, pp.215 -232, 1995.
[11] J. Whittaker , Graphical Models in Applied Multivariate
Statistics , Wiley, New York, 1990.
[12] S. L. Lauritzen, Graphical Models , Oxford University
Press, UK, 1996.
[13] S. L. Lauritzen and D. J. Spiegelhalter, “Local
computations with probabilities on graphical structures
and their application to expert systems ”, Journal of the
Royal Statistical Society, Series B, Vol. 50 , No. 2 , 1988.
[14] C. Glymour, R. Sch eines, P. Spirtes and K. Kelly,
Discovering Causal Structure: Artificial Intelligence,
Philosophy of Science, and Statistical Modeling ,
Academic Press, CA, 1987.
[15] G. F. Cooper and E. Herskovitz, “A Bayesian method for
the induction of probabilistic networks from data ”,
Machine Learning, Vol. 9, 1992.
[16] D. Heckerman, D. Geiger and D. M. Chickering,
“Learning Bayesian networks: The co mbinations of
knowledge and statistical data ”, Machine Learning, Vol.
20, 1995.
[17] C. L. Blake and C. J. Merz, Churn data set, UCI
repository of machine learning Databases .
http://www.ics.uci.edu/ ∼mlearn/MLRepository.html.
University of California, Department of Information and
Computer Science, 1998.
[18] K. B. Korb and A. E. Nicholson, Bayesian artificial
intelligence 2nd Edition , Chapman and Hall, Florida,
2011.
[19] F. V. Jensen, Bayesian networks and de cision graphs ,
Springer, New York, 2001.
[20] T. S. Verma and J. Pearl, “Equivalence and synthesis of
causal models ”, Uncertainty in Artificial Intelligence,
Vol. 6, 1991.
[21] T. M. Cover and J. A. Thomas, Elements of information
theory 2nd Edition , Wiley , 2006.
[22] A. Agresti, Categorical Data Analysis 3rd Edition , John
Wiley & Sons, 2013.
[23] J. Hausser and K. Strimmer, “Entropy inference and the
James -Stein estimator, with application to nonlinear gene association networks ”, Journal of Mach ine Learning
Research, Vol. 10 , 2009.
[24] M. Scutari and A. Brogini, “Bayesian network structure
learning with permutation tests ”, Communications in
Statistics – Theory and Methods, Vol. 41, No. 16 –17,
2012.
[25] D. Koller and M. Sahami, “Toward Optimal Feature
Selection ”, Technical Report, Stan ford InfoLab, 1996.
[26] D. Margaritis and S. Thrun, “Bayesian Network Induction
via Local Neighborhoods ”, Advances in Neural
Information Processing Systems, Vol. 12, 1999.
[27] I. Tsamardinos, C. F. Aliferis and A. R. Statnikov,
“Algorithms for Large Scale Markov B lanket Discovery ”,
Sixteenth International Florida Artificial Intelligence
Research Society Conference, AAAI Press, Florida, 2003.
[28] P. Spirtes and C. Glymour, “An Algorithm for Fast
Recovery of Sparse Causal Graphs ”, Social Science
Computer Review, Vol. 9, No. 1 , 1991.
[29] P. Spirtes , C. Glymour and R. Scheines, Causation,
Prediction and Search 2nd Edition , The MIT Press, 2001.
[30] I. Tsamardinos, C. F. Aliferis, and A. R. Statnikov, “ Time
and sample efficient discovery of Markov blankets and
direct causal relations ”, Ninth ACM SIGKDD
International Conference on Knowledge Discovery and
Data Mining, 2003.
[31] S. Yaramakala and D. Margaritis, “Speculative Markov
Blanket Discovery for Optimal Feature Selection ”, Fifth
IEEE International Conference on Data Mining, 2005.
[32] I. Tsamardinos , L. E. Brown and C. F. Aliferis, “The
Max-Min Hill -Climbing Bayesian network structure
learning algorithm ”, Machine Learning, Vol. 65, No. 1,
2006.
[33] J.M. Peña, R. Nilsson, J. Björkegren and J. Tegnér,
“Towards scalable and data efficient learning of Markov
boundaries ”, International Journal of Approximate
Reasoning, Vol. 45, No. 2, 2007.
[34] C. F. Aliferis, I. Tsamardinos and A. R. Statnikov,
“HITON: A novel Markov blanket algorithm for optimal
variable selection ”, American Medical Informatics
Associa tion Annual Symposium, 2003.
[35] I. A. Beinlich, H. J. Suermondt, R. M. Chavez and G. F.
Cooper, “ The ALARM Monitoring System: A Case Study
with two Probabilistic Inference Techniques for Belief
Networks ”, Second European Conference on Artificial
Intelligence in Medicine, Lecture Notes in Medical
Informatics, Vol. 38, 1989.
[36] S. K. Fu, “Efficient Learning of Markov Blanket and
Markov Blanket Classifier ”, PhD Thesis, University of
Montreal, Canada, 2010.
[37] R. E. Bellman, “ Dynamic programming and Lagrange
multipliers ”, Proceedings of the National Academy of
Sciences of the USA, Vol. 42, No.10, 1956.
[38] F. V. Jensen and T. D. Nielsen, Bayesian Networks and
Decision Graphs , Springer, Germany, 2007.
[39] D. M. Chickering, D. Heckerman and C. Meek, “Large –
sample learning of Bayes ian network is NP -hard ”.
Journal of Machine Learning Research, Vol. 5, 1994.
[40] D. Heckerman, “A tutorial on learning with Bayesian
networks ”, In Learning in Graphical Models, 1999.
[41] I. B. Brândușoiu and G. Toderean, “Applying principal
component analysis on call detail records ”, ACTA
Technica Napocensis Electronics and Telecommunica –
tions, Vol. 55, Nr. 4, 2014.
[42] H. He and Y. Ma, Imbalanced learning foundations,
algorithms, and applications . John Wiley & Sons, 2013.

Similar Posts