Performance Comparison of ADTree and Naive [601356]

Performance Comparison of ADTree and Naive
Bayes Algorithms for Spam Filtering
Thanh Nguye n1, Andrei Doncesc u2 and Pierre Sigl e3
1Laboratoire Analyse Architecte System, LAAS -CNRS, TOULOUSE, FRANCE
2CRCT -INSERM/UPS -IUT, TOULOUSE, FRANCE
3Aix-Marseille Unive rsity, MARSEILLE, FRANCE
1 Email : [anonimizat] 2 Email : [anonimizat] 3 Email : [anonimizat] -mrs.fr

Abstract—Classification is an important data mining
technique and could be used as pretreatment in
artificial intelligence . The broad applications of
classification of various kinds of data allows to be
used in nearly every field of our life. Classification is
used to classify the item according to the features of
some item according the predefined set of classes or
groups.
In this paper we present some tests based on the
correct and incorrect instances of data classification
using Naive Bayes and ADTree classification
algorithm. Naive Bayes algorithm is based on
probability and ADTree algorithm is based on
decision tree. The paper sets out comparative
evaluation s of classifiers Naive Bayes and ADTree in
the context of Spambase dataset to maximize true
positive rate and minimize false positive rate of
defaulters rather than achieving only higher
classification accuracy using W EKA tool.
The experiments results presented in this paper
concern classification accuracy and cost analysis. As
well as that results based on ADTree are better then
that of Naive bayes.

Key words – Classification, Data Mining, Machine
Learning, Spam filtering, Naive bayes, ADTre
Decision tree, artificial .
I. INTRODUCTION
Data Mining allowed the development of a new
research field “The Big Data”. The big data is related to
the great growth in the volume of the data available on
the internet, digital libr aries, news sources and company –
wide intranets. This new field tries to answer to how a
huge number of databases and information r epositories
could be organize d, analyze d and how it is possible to
retrieve information from this data. It is obviously t his
questions generate an eminent need of methods that can
help users to efficiently navigate, summarize, and organize the data so that it can further be used for
applications ranging from market analysis, fraud
detection etc. Therefore, the techniques that per form data
analysis and may uncover important data patterns are
needed.
Data mining is growing in various applications widely
like analysis of organic compounds, medicals diagnosis,
product design, targeted marketing, credit card fraud
detection, financial forecasting, automatic abstraction,
spam detection etc. Data mining refers to the analysis of
the large quantities of data that are stored in computers or
internet.
Data mining is not specific to one type of media or
data. Data mining should be appl icable to any kind of
information repository. Data mining is being put into use
and studied for databases, including relational databases,
object -relational databases and object oriented databases,
data warehouses, transaction databases, unstructured and
semi-structured repositories such as the World Wide
Web, advanced databases such as spatial databases,
multimedia databases, time -series databases, textual
databases, and even flat files.
Different functions of data mining are mainly
classified as classific ation, clustering, feature selection
and association rule mining. In this paper, we focus on
the data classification and the performance measure of
the classifier algorithms based on True Position Rate (TP
Rate), False Position Rate (FP Rate) generated by the
algorithms when applied on the data set [1][2].
Classification analysis is the organization of data in
given classes. Also known as supervised classification,
the classification uses given class labels to order the
objects in the data collection. Class ification approaches
normally use a training set where all objects are already
associated with known class labels. The classification
algorithm learns from the training set and builds a model.
The model is used to classify new objects.

II. DATA CLASSIFIER

In this section, it is presented two type of algorithm:
Naive bayes classifiers algorithm and ADTree decision
tree algorithm in the view of comparison . The
comparison is made on accuracy, sensitivity and
specificity using true positive and false positive in
confusion matrix generated by the respec tive algorithms.
As well as, correct and incorrect instances that helping to
define a most efficient classification method by using the
confusion matrix [2].
A. Algorithm Naive Bayes
Naive Bayes algorithm is a simple t echni que for
constructing probabilistic models that assign cl ass labels
to problem instances. The input of this classifier is a
vector containing the feature s as numerical data , where
the output is represented by class labels drawn from a
finite set. The s uccess of this technic is due to the text
retrie val application, which it explains the natural
application to spam filtering. The choice of this classifier
is suited when the dimensionality of the input is high and
it requires a small amount of training da ta to estimate the
intrinsic parameters.
The Naive Bayes algorithm calculates a set of
probabilities , in a very traditional form: by counting the
frequency . The algorithm is based on Bayes theorem and
assumes all attributes to be independent given the value
of the class variable. This conditional independence
assumption is rarely true in the real world applications.
However, the algorithm tends to perform well and learn
rapidly in various supervised classification problems [3].
Parameter estimat ion for naïve Bayes models uses
the method of maximum likelihood. The probability that
a document
d with vector
12, ,…,n x x x belongs to
hypothesis
h is[1][4]

11
1
1 1 2 2( | ) ( )( | )( | ) ( ) ( | ) ( )i
i
iiP x h P hP h xP x h P h P x h P h
(1)
(1)

Here,
1( | )i P h x is posterior probability, while
1()Ph is
the prior probability associated with hypothes is
1h . For
n different hypotheses, we have

1( ) ( | ) ( )n
i i j j
jP x P x h P h

(2)

Thus, we have
11
1( | ) ( )( | )()i
i
iP x h P hP h xPx (3)
B. Algorithm ADTree
An Alternating Decision Tree (ADTree) [5] is a
machine learning method for classification. It generalizes
decision trees and the well-known “fame” is related to
boosting technic .
An ADTree consists of an alternation of decision
nodes, which specify a predicate condition, and
prediction nodes, wh ich contain a single number. An
instance is classified by an ADTree by following all
paths for which all decision nodes are true, and summing
any prediction nodes that are traversed.
Alternating decision trees introduce structure to the
set of hypotheses b y requiring that they build off a
hypothesis that was produced in an earlier iteration. The
resulting set of hypotheses can be visualized in a tree
based on the relationship between a hypothesis and its
"parent." Another important feature of boosted
algori thms is that the data is given a different distribution
at each iteration. Instances that are misclassified are
given a larger weight while accurately classified
instances are given reduced weight [6].
An alternating decision tree consists of decision node s
and prediction nodes. Decision nodes specify a predicate
condition. Prediction nodes contain a single number.
ADTrees always have prediction nodes as both root and
leaves. An instance is classified by an ADTree by
following all paths for which all decisi on nodes are true
and summing any prediction nodes that are traversed.
This is different from binary classification trees such as
CART (Classification and regression tree) or C4.5 in
which an instance follows only one path through the tree.
[7]
An addition al attractive feature of ADTrees, one that
is not possible with conventional boosting procedures, is
their ability to be merged together. This is a particularly
useful attribute in the context of multiclass problems as
they are often reformulated in the tw o-class setting using
one or more classes against the others. In such a setting
ADTrees can be combined into a single classifier.
ADTree results in a set of paths through the tree such
that one path is taken at each prediction node that is
reached. An ins tance is given a score by summing the
scores of all decision

Figure 1. The following tree was constructed using
ADTree on the Spambase datase.

The ADTree algorithm, produces a set
i PP of
preconditions and a set
i RR of rules. A single rule
consists of a simple conditional involving a precondition,
a test condition
ic and a set of signed numerical
predictions
1p and
2p . The set o f all test conditions is
labeled
C . Preconditions are conjunctions of conditions
and negations of conditions.

III. PERFORMANCE EVALUATION AND COMPARISON
The function which compare classification models by
providing a quality measure f or classifier when solving a
classification problem is score. The score is based on
Confusion matrix or on Receiver Operating
Characteristics (ROC).
The measurement obtained by using Confusion Matrix
are: accuracy, recall, specificity, precision and F -score
and in the case of ROC I sthe area under the ROC
curve(AUC). Independently of the indices adopted, an
important aspect to be considered is the asymmetry in the
misclassification costs.
A Spam message incorrectly classified as legitimate is
a relatively minor problem, as the user is simply required
to remove it. On the other hand, a legitimate message
mislabeled as Spam can be unacceptable, as it implies the
loss of potentially important information, particularly in
configurations in which Spam messages are automatically deleted. For this reason, describing the
performance of an algorithm solely in terms of the
classification accuracy (the relative number of messages
correctly classified) is not adequate, as it assumes equal
misclassification costs for bo th classes .
We consider the application of a filter to a test dataset
with
ln legitimate and
sn Spam messages, resulting in
,lsn
and
,sln being incorrectly clas sified, respectively. In
this case, it clearly follows that the number of correctly
classified legitimate and Spam messages are given by
,,l l l l sn n n
and
,,s s s s ln n n respectively.

In decision theory, two classes are labeled as positive
(spam) and negative (legitimate), with the performance
measures being the true positive
,()ss
snTPn and
negative
,()ll
lnTNn corresponding to the relative
number of instances of each class that have been
correctl y classified. From these, the false positive and
negative rates can be obtained
1 FP TN and
1 FN TP

A. Confusion matrix
A confusion matrix contains information about actual
and predicted classifications done by a classi fication
system. Performance of such systems is commonly
evaluated using the data in the matrix. A confusion
matrix illustrates the accuracy of the solution to a
classification problem. Given n classes a confusion
matrix is a
nm matrix, where
,ijc indicates the number
of tuples from
D that were assign to class
,ijc but where
the correct class is
ic Obviously the best solution will
have onl y zero values outside the diagonal [1].
A confusion matrix contains information about
actual and predicted classifications done by a
classification system. Performance of such systems is
commonly evaluated using the data in the matrix. The
following table shows the confusion matrix for a two
class classifier.
The following table shows the confusion matrix
for a two class classifier. The entries in the confusion
matrix have the following meaning in the context of our
study: [8]
TP is the number of correct pr edictions that an
instance is positive,
FP is the number of incorrect predictions that an
instance is positive,

TN is the number of correct of predictions that
an instance negative,
FN is the number of incorrect predictions that an
instances negative [9].
Table I
TABLE TYPE STYLES
0 1 Classe
TP (True Positives) FP (False Positives) a=0
FN (False Negatives) TN (True Negatives) b=1

B. Several standard terms have been defined
Precision
PrTPecisionTP FN
(4)
Recall
ReTPcall TPRTP FP
(5)
F-Measure
2*Pr *Re.Pr Reecision callF Measureecision call
(6)
Kappa statistic
0
1e
ePPkP
(7)
Wthit
0P : the proportion of the sample on which both
judges agree.
And
.
2iii
eppPn
(8)
where:
ip
: sum of elements of the line
i
.ip
sum of elements of the column
i
n
size of sample
IV. EXPERIMENTAL WORK AN D RESULTS
In this section we compare the classificat ion accuracy
results of decision tree algorithms ADTree and
classification algorithm Naive Bayes . The simulations
were conducted using a large spam email dataset
consisting of 4601 instances having 58 attributes. The
UCI dataset has been modified accordin gly and used. All
simulations were performed using weak machine
learning environment which consists of collection of
popular learning schemes that can be used for practical
data mining. We list below the steps taken to achieve
desired results: Firstly, the dataset named Spam email
was taken from the UCI machine learning repository [10] http://www1.ics.uci.edu/ mlearn/MLRepository.html.
The “spam” concept is diverse: advertisements for
products/web sites, make money fast schemes, chain
letters, pornography. UCI Machine Learning collection
of spam e -mails came from their postmaster and
individuals who had filed spam. Their collection of non –
spam emails came from filed work and personal e -mails,
and hence the word “george” and the area code “650” are
indicators of non -spam. These are useful when
constructing a personalized spam filter. One would either
have to blind such non -spam indicators or get a very
wide collection of non -spam to generate a general
purpose spam filter. In this dataset the number of
instance s is 4601 out of which 1813 Spam which is equal
39.4% and the numbers of attributes are 58 out of which
57 are continuous and 1 has nominal class label.
Attribute Information: The last column of
“spambase.data” denotes whether the e -mail was
considered spa m (1) or not (0), i.e. unsolicited
commercial e -mail. Most of the attributes indicate
whether a particular word or character was frequently
occuring in the e -mail. The run -length attributes (55 -57)
measure the length of sequences of consecutive capital
letters. For the statistical measures of each attribute, see
the end of this file. Here are the definitions of the
attributes:
48 continuous real [0,100] attributes of type
word_freq_WORD = percentage of words in the e -mail
that match WORD, i.e. 100 * (number of times the
WORD appears in the e -mail) / total number of words in
e-mail. A "word" in this case is any string of
alphanumeric characters bounded by non -alphanumeric
characters or end -of-string.
6 continuous real [0,100] attributes of type
char_freq_CHAR = percentage of characters in the email
that match CHAR, i.e. 100 * (number of CHAR
occurences) / total characters in e -mail
1 continuous real [1,…] attribute of type
capital_run_length_average = average length of
uninterrupted sequences of capital lett ers
1 continuous integer [1,…] attribute of type capital_
run_length_longest = length of longest uninterrupted
sequence of capital letters
1 continuous integer [1,…] attribute of type capital_
run_length_total = sum of length of uninterrupted
sequences of capital letters = total number of capital
letters in the e -mail
1 nominal 0,1 class attribute of type spam = denotes
whether the e -mail was considered spam (1) or not (0),
i.e. unsolicited commercial e -mail.
Class Distribution : Spam are 1813 (39.4%) an d Non –
Spam are 2788 (60.6%) Second, Then it was later

converted into the ARFF (Attribute Relation File
Format) which is the WEKA data File Format. This data
contained 4601 number of Instances and out of which
1831 are of Spam category (that is 39.4%) and total
number of Attributes are 58 out of which 57 are
continuous and 1 is nominal class label.
A. Results for classifcation using Algorithm Naive
Bayes
Algorithm Naive Bayes is applied on the Spambase
dataset and the confusion matrix is generated for class
gender having two possible values 0 or 1.

Confusion Matrix:
Table II
CONFUSION MATRIX

0 1 Classe
1925 863 a=0
79 1734 b=1

For above confusion matrix, true positives for
class a = 0 is 1925 while false positives is 863 whereas,
for class b = 1, true pos itives is 79 and false positives is
1734. Diagonal elements of matrix 1925 + 1734 =3659
represents the correct instances classified and other
elements 79+863 = 942 represents the incorrect
instances.

TP rate for class a = 1925/(1925+863) = 0.690
FP rate f or class a = 79/(79+1734) = 0.044
TP rate for class b = 1734/(79+1734) = 0.956
FP rate for class b = 863/(863+1925) = 0.309
Precision for class a = 1925/(1925+79) = 0.961
Precision for class b = 1734/(1734+863) = 0.668
F-measure for class a = 2*0.961*0.690 /(0.961+0.690)=
0.803
F-measure for class b = 2*0.668*0.956/(0.668*0.956)=
0.786

Figure 2. Cost/Benefit Analysis of Naive Bayes (class =
0).

B. Results for classifcation using Algorithm ADTree
Algorithm Algorithm ADTree is applied on the
Spambase dat aset and the confusion matrix is generated
for class gender having two possible values 0 or 1.

Confusion Matrix
Table III
CONFUSION MATRIX

0 1 Classe
2610 178 a=0
171 1642 b=1
For above confusion matrix, true positives for class a = 0
is 2610 while fa lse positives is 178 whereas, for class b =
1, true positives is 171 and false positives is 1642.

diagonal elements of matrix 2610 + 1642 = 4252
represents the correct instances classified and other
elements 171+178 = 349 represents the incorrect
instances .
TP rate for class a = 2610/(2610+178) = 0.936
FP rate for class a = 171/(171+1642) = 0.094
TP rate for class b = 1642/(1642+171) = 0.906
FP rate for class b = 178/(178+2610) = 0.064
Precision for class a = 2610/(2610+171) = 0.939
Precision for class b = 1642/(1642+178) = 0.902
F-measure for class a = 2*0.939*0.936/(0.939+0.936)=
0.937
F-measure for class b = 2*0.902*0.906/(0.902*0.906)=
0.904

Figure 4. Cost/Benefit Analysis of ADTree (class = 0).

Figure 5. Cost/Benefit Analysis of ADTree (class = 1).

C. Compare results
From above experimental work we can conclude that
correct instances generated by Naive Bayes are 3659 and
ADTree are 4252, as well as performance evolution on
the basis of Spambase dataset are shown in the table IV

Table IV
DETAILED PRECISION CLASS ON TRAINING SET

Algorithmes Precision Recall F.Measure Correctlly Kappa
Naïve Bayes 0.961 0.690 0.803 0.795 0.601
ADTree 0.939 0.936 0.937 0.924 0.841

V. CONCLUSION AND FUTURE WORK
From above experimental results, both ADTree and
Naïve Bayes algorithms are performed on the given
SpamBase data set and Weka tool. The experiments
results shown in the study are about classification
accuracy and cost analysis. ADTree gives more
classification accuracy for class in Spambase data set
having tw o values Yes and No. Though here in this
example, cost analysis valued same for both the
classifier, we can prove that ADTree is cost efficient than
the Naive Bayes classifier.
The experiments results shown in this paper are about
classification accuracy a nd cost analysis. The results in
the paper on this dataset also show that the efficiency and
accuracy of ADTree is better then that of Naive bayes.
Future work: we will combine the two methods
statistical and decision tree to classify medical data for
diagnosis and detection of cancer.

REFERENCES

[1] Margaret H. Danham,S. Sridhar, Data
mining,Introductory and Advanced Topics, Person
education , 1st ed., 2006
[2] Wenke Lee, Salvatore J. Stolfo, Kui W. Mok, A Data
Mining Framework for Building Intrusion Det ection
Models
[3] George Dimitoglou, James A. Adams, and Carol M.
Jim, Comparison of the C4.5 and a Naive Bayes
Classifier for the Prediction of Lung Cancer
Survivability
[4] Seongwook Youn, Dennis McLeod, A Comparative
Study for Email Classification
[5] Y oav Freund and Llew Mason. The Alternating
Decision Tree Algorithm. Proceedings of the 16th
International Conference on Machine Learning, pages
124-133 (1999) [6] Bernhard Pfahringer, Geoffrey Holmes and Richard
Kirkby, Optimizing the Induction of Alternat ing
Decision Trees, Proceedings of the Fifth Pacific -Asia
Conference on Advances in Knowledge Discovery and
Data Mining. 2001, pp. 477 -487.
[7] http://en.wikipedia.org/wiki/Alternating -decision –
tree.
[8] Anshul Goyal, Rajni Mehta, Performance
Comparison of Naive Bayes and J48 Classification
Algorithms, IJAER, Vol. 7, No. 11, 2012, pp.
[9] Xiang yang Li, Nong Ye, A Supervised Clustering
and Classification Algorithm for Mining Data With
Mixed Variables, IEEE TRANSACTIONS ON
SYSTEMS, MAN, AND CYBERNETICS, Vol. 36, No.
2, 2006, pp. 396 -406
[10] http://www1.ics.uci.edu/mlearn/MLRepository.html

Similar Posts