Classifi Ca tion [625576]

Licență

Classifi Ca tion [625576]

Byadmin ianuarie 1, 2024

Data
Classifi Ca tion
Algorithms and Applications

Chapman & Hall/CRC
Data Mining and Knowledge Discovery Series
PUBLISHED TITLESSERIES EDITOR
Vipin Kumar
University of Minnesota
Department of Computer Science and Engineering
Minneapolis, Minnesota, U.S.A.
AIMS AND SCOPE
This series aims to capture new developments and applications in data mining and knowledge
discovery, while summarizing the computational tools and techniques useful in data analysis. This
series encourages the integration of mathematical, statistical, and computational methods and techniques through the publication of a broad range of textbooks, reference works, and hand –
books. The inclusion of concrete examples and applications is highly encouraged. The scope of the
series includes, but is not limited to, titles in the areas of data mining and knowledge discovery
methods and applications, modeling, algorithms, theory and foundations, data and knowledge
visualization, data mining systems and tools, and privacy and security issues.
ADV ANCES IN MACHINE LEARNING AND DATA MINING FOR ASTRONOMY
Michael J. Way, Jeffrey D. Scargle, Kamal M. Ali, and Ashok N. Srivastava
BIOLOGICAL DATA MINING
Jake Y . Chen and Stefano Lonardi
COMPUTATIONAL BUSINESS ANALYTICS
Subrata Das
COMPUTATIONAL INTELLIGENT DATA ANALYSIS FOR SUSTAINABLE
DEVELOPMENT
Ting Yu, Nitesh V . Chawla, and Simeon Simoff
COMPUTATIONAL METHODS OF FEATURE SELECTION
Huan Liu and Hiroshi Motoda
CONSTRAINED CLUSTERING: ADV ANCES IN ALGORITHMS, THEORY ,
AND APPLICATIONS Sugato Basu, Ian Davidson, and Kiri L. Wagstaff
CONTRAST DATA MINING: CONCEPTS, ALGORITHMS, AND APPLICATIONS
Guozhu Dong and James Bailey
DATA CLASSIFICATION: ALGORITHMS AND APPLICATIONS
Charu C. Aggarawal
DATA CLUSTERING: ALGORITHMS AND APPLICATIONS
Charu C. Aggarawal and Chandan K. Reddy

DATA CLUSTERING IN C++: AN OBJECT-ORIENTED APPROACH
Guojun Gan
DATA MINING FOR DESIGN AND MARKETING
Yukio Ohsawa and Katsutoshi Yada
DATA MINING WITH R: LEARNING WITH CASE STUDIES
Luís Torgo
FOUNDATIONS OF PREDICTIVE ANALYTICS
James Wu and Stephen Coggeshall
GEOGRAPHIC DATA MINING AND KNOWLEDGE DISCOVERY ,
SECOND EDITION Harvey J. Miller and Jiawei Han
HANDBOOK OF EDUCATIONAL DATA MINING
Cristóbal Romero, Sebastian Ventura, Mykola Pechenizkiy, and Ryan S.J.d. Baker
INFORMATION DISCOVERY ON ELECTRONIC HEALTH RECORDS
Vagelis Hristidis
INTELLIGENT TECHNOLOGIES FOR WEB APPLICATIONS
Priti Srinivas Sajja and Rajendra Akerkar
INTRODUCTION TO PRIV ACY-PRESER VING DATA PUBLISHING: CONCEPTS
AND TECHNIQUES Benjamin C. M. Fung, Ke Wang, Ada Wai-Chee Fu, and Philip S. Yu
KNOWLEDGE DISCOVERY FOR COUNTERTERRORISM AND
LAW ENFORCEMENT David Skillicorn
KNOWLEDGE DISCOVERY FROM DATA STREAMS
João Gama
MACHINE LEARNING AND KNOWLEDGE DISCOVERY FOR
ENGINEERING SYSTEMS HEALTH MANAGEMENT Ashok N. Srivastava and Jiawei Han
MINING SOFTWARE SPECIFICATIONS: METHODOLOGIES AND APPLICATIONS
David Lo, Siau-Cheng Khoo, Jiawei Han, and Chao Liu
MULTIMEDIA DATA MINING: A SYSTEMATIC INTRODUCTION TO
CONCEPTS AND THEORY
Zhongfei Zhang and Ruofei Zhang
MUSIC DATA MINING
Tao Li, Mitsunori Ogihara, and George Tzanetakis
NEXT GENERATION OF DATA MINING
Hillol Kargupta, Jiawei Han, Philip S. Yu, Rajeev Motwani, and Vipin Kumar
RAPIDMINER: DATA MINING USE CASES AND BUSINESS ANALYTICS
APPLICATIONS Markus Hofmann and Ralf Klinkenberg

RELATIONAL DATA CLUSTERING: MODELS, ALGORITHMS,
AND APPLICATIONS Bo Long, Zhongfei Zhang, and Philip S. Yu
SER VICE-ORIENTED DISTRIBUTED KNOWLEDGE DISCOVERY
Domenico Talia and Paolo Trunfio
SPECTRAL FEATURE SELECTION FOR DATA MINING
Zheng Alan Zhao and Huan Liu
STATISTICAL DATA MINING USING SAS APPLICATIONS, SECOND EDITION
George Fernandez
SUPPORT VECTOR MACHINES: OPTIMIZATION BASED THEORY ,
ALGORITHMS, AND EXTENSIONS Naiyang Deng, Yingjie Tian, and Chunhua Zhang
TEMPORAL DATA MINING
Theophano Mitsa
TEXT MINING: CLASSIFICATION, CLUSTERING, AND APPLICATIONS
Ashok N. Srivastava and Mehran Sahami
THE TOP TEN ALGORITHMS IN DATA MINING
Xindong Wu and Vipin Kumar
UNDERSTANDING COMPLEX DATASETS: DATA MINING WITH MATRIX
DECOMPOSITIONS
David Skillicorn

Data
Classifi Ca tion
Algorithms and Applications
Edited by
Charu C. Aggarwal
IBM T. J. Watson Research Center
Y orktown Heights, New Y ork, USA

CRC Press
Taylor & Francis Group6000 Broken Sound Parkway NW, Suite 300Boca Raton, FL 33487-2742
© 2015 by Taylor & Francis Group, LLC
CRC Press is an imprint of Taylor & Francis Group, an Informa business
No claim to original U.S. Government worksPrinted on acid-free paper
Version Date: 20140611
International Standard Book Number-13: 978-1-4665-8674-1 (Hardback)This book contains information obtained from authentic and highly regarded sources. Reasonable efforts have been made to
publish reliable data and information, but the author and publisher cannot assume responsibility for the validity of all materials or the consequences of their use. The authors and publishers have attempted to trace the copyright holders of all material repro –
duced in this publication and apologize to copyright holders if permission to publish in this form has not been obtained. If any copyright material has not been acknowledged please write and let us know so we may rectify in any future reprint.
Except as permitted under U.S. Copyright Law, no part of this book may be reprinted, reproduced, transmitted, or utilized in any
form by any electronic, mechanical, or other means, now known or hereafter invented, including photocopying, microfilming, and recording, or in any information storage or retrieval system, without written permission from the publishers.
For permission to photocopy or use material electronically from this work, please access www.copyright.com (http://www.copy –
right.com/) or contact the Copyright Clearance Center, Inc. (CCC), 222 Rosewood Drive, Danvers, MA 01923, 978-750-8400.
CCC is a not-for-profit organization that provides licenses and registration for a variety of users. For organizations that have been granted a photocopy license by the CCC, a separate system of payment has been arranged.
Trademark Notice: Product or corporate names may be trademarks or registered trademarks, and are used only for identifica-
tion and explanation without intent to infringe.
Library of Congress Cataloging‑in‑Publication Data
Data classification : algorithms and applications / edited by Charu C. Aggarwal.
pages cm
–
(Chapman & Hall/CRC data mining and knowledge discovery series ; 35)
Summary: “This book homes in on three primary aspects of data classification: the core methods for data
classification including probabilistic classification, decision trees, rule -b
ased methods, and SVM methods;
different problem domains and scenarios such as multimedia data, text data, biological data, categorical data,
network data, data streams and uncertain data: and different variations of the classification problem such as ensemble methods, visual methods, transfer learning, semi
-s
upervised methods and active learning. These
advanced methods can be used to enhance the quality of the underlying classification results” –
–
Provided by
publisher.
Includes bibliographical references and index.
ISBN 978 -1-4
665-8
674-1
(hardback)
1. File organization (Computer science) 2. Categories (Mathematics) 3. Algorithms. I. Aggarwal, Charu
C.
QA76.9.F5.D38 2014
005.74’1 –
-d
c23
2013050912
Visit the Taylor & Francis Web site at
http://www.taylorandfrancis.com
and the CRC Press Web site at
http://www.crcpress.com

To my wife Lata, and my daughter Sayani

Contents
Editor Biography xxiii
Contributors xxv
Preface xxvii
1 An Introduction to Data Classiﬁcation 1
Charu C. Aggarwal
1.1 Introduction . . . . . ………………………….. 2
1 . 2 C o m m o n T e c h n i q u e s i n D a t a C l a s s i ﬁ c a t i o n ……………….. 4
1.2.1 Feature Selection Methods …………………… 4
1.2.2 Probabilistic Methods . . . …………………… 6
1 . 2 . 3 D e c i s i o n T r e e s…………………………. 71.2.4 Rule-Based Methods . . . …………………… 9
1 . 2 . 5 I n s t a n c e – B a s e d L e a r n i n g ……………………. 1 1
1 . 2 . 6 S V M C l a s s i ﬁ e r s………………………… 1 1
1 . 2 . 7 N e u r a l N e t w o r k s ………………………… 1 4
1 . 3 H a n d i n g D i f f e r e n t D a t a T y p e s ……………………… 1 6
1 . 3 . 1 L a r g e S c a l e D a t a : B i g D a t a a n d D a t a S t r e a m s ………….. 1 6
1 . 3 . 1 . 1 D a t a S t r e a m s …………………….. 1 6
1 . 3 . 1 . 2 T h e B i g D a t a F r a m e w o r k ……………….. 1 7
1 . 3 . 2 T e x t C l a s s i ﬁ c a t i o n……………………….. 1 81.3.3 Multimedia Classiﬁcation . …………………… 2 0
1 . 3 . 4 T i m e S e r i e s a n d S e q u e n c e D a t a C l a s s i ﬁ c a t i o n ………….. 2 0
1 . 3 . 5 N e t w o r k D a t a C l a s s i ﬁ c a t i o n …………………… 2 1
1 . 3 . 6 U n c e r t a i n D a t a C l a s s i ﬁ c a t i o n ………………….. 2 1
1 . 4 V a r i a t i o n s o n D a t a C l a s s i ﬁ c a t i o n …………………….. 2 2
1 . 4 . 1 R a r e C l a s s L e a r n i n g………………………. 2 21 . 4 . 2 D i s t a n c e F u n c t i o n L e a r n i n g…………………… 2 2
1 . 4 . 3 E n s e m b l e L e a r n i n g f o r D a t a C l a s s i ﬁ c a t i o n ……………. 2 3
1.4.4 Enhancing Classiﬁcatio n Methods with Additional Data ……… 2 4
1 . 4 . 4 . 1 S e m i – S u p e r v i s e d L e a r n i n g………………. 2 4
1 . 4 . 4 . 2 T r a n s f e r L e a r n i n g …………………… 2 6
1 . 4 . 5 I n c o r p o r a t i n g H u m a n F e e d b a c k…………………. 2 7
1 . 4 . 5 . 1 A c t i v e L e a r n i n g ……………………. 2 81 . 4 . 5 . 2 V i s u a l L e a r n i n g ……………………. 2 9
1 . 4 . 6 E v a l u a t i n g C l a s s i ﬁ c a t i o n A l g o r i t h m s………………. 3 0
1 . 5 D i s c u s s i o n a n d C o n c l u s i o n s ………………………. 3 1
ix

x Contents
2 Feature Selection for Classiﬁcation: A Review 37
Jiliang Tang, Salem Alelyani, and Huan Liu
2.1 Introduction . . . . . ………………………….. 3 8
2 . 1 . 1 D a t a C l a s s i ﬁ c a t i o n ……………………….. 3 92 . 1 . 2 F e a t u r e S e l e c t i o n ……………………….. 4 0
2 . 1 . 3 F e a t u r e S e l e c t i o n f o r C l a s s i ﬁ c a t i o n……………….. 4 2
2 . 2 A l g o r i t h m s f o r F l a t F e a t u r e s ………………………. 4 3
2.2.1 Filter Models . . . ………………………. 4 4
2 . 2 . 2 W r a p p e r M o d e l s ………………………… 4 6
2 . 2 . 3 E m b e d d e d M o d e l s ……………………….. 4 7
2 . 3 A l g o r i t h m s f o r S t r u c t u r e d F e a t u r e s ……………………. 4 9
2 . 3 . 1 F e a t u r e s w i t h G r o u p S t r u c t u r e ………………….. 5 02 . 3 . 2 F e a t u r e s w i t h T r e e S t r u c t u r e …………………… 5 1
2 . 3 . 3 F e a t u r e s w i t h G r a p h S t r u c t u r e ………………….. 5 3
2 . 4 A l g o r i t h m s f o r S t r e a m i n g F e a t u r e s ……………………. 5 5
2 . 4 . 1 T h e G r a f t i n g A l g o r i t h m…………………….. 5 62 . 4 . 2 T h e A l p h a – I n v e s t i n g A l g o r i t h m…………………. 5 6
2 . 4 . 3 T h e O n l i n e S t r e a m i n g F e a t u r e S e l e c t i o n A l g o r i t h m ……….. 5 7
2 . 5 D i s c u s s i o n s a n d C h a l l e n g e s……………………….. 5 7
2.5.1 Scalability . ………………………….. 5 7
2.5.2 Stability . . ………………………….. 5 8
2 . 5 . 3 L i n k e d D a t a ………………………….. 5 8
3 Probabilistic Models for Classiﬁcation 65
Hongbo Deng, Yizhou Sun, Yi Chang, and Jiawei Han
3.1 Introduction . . . . . ………………………….. 6 6
3 . 2 N a i v e B a y e s C l a s s i ﬁ c a t i o n ……………………….. 6 7
3 . 2 . 1 B a y e s ’ T h e o r e m a n d P r e l i m i n a r y………………… 6 73 . 2 . 2 N a i v e B a y e s C l a s s i ﬁ e r ……………………… 6 9
3.2.3 Maximum-Likelihood Estimates for Naive Bayes Models . . . ….. 7 0
3 . 2 . 4 A p p l i c a t i o n s………………………….. 7 1
3 . 3 L o g i s t i c R e g r e s s i o n C l a s s i ﬁ c a t i o n ……………………. 7 2
3 . 3 . 1 L o g i s t i c R e g r e s s i o n ………………………. 7 3
3 . 3 . 2 P a r a m e t e r s E s t i m a t i o n f o r L o g i s t i c R e g r e s s i o n………….. 7 4
3 . 3 . 3 R e g u l a r i z a t i o n i n L o g i s t i c R e g r e s s i o n ………………. 7 53 . 3 . 4 A p p l i c a t i o n s………………………….. 7 6
3.4 Probabilistic Graphical Models for Classiﬁcation …………….. 7 6
3 . 4 . 1 B a y e s i a n N e t w o r k s ………………………. 7 6
3 . 4 . 1 . 1 B a y e s i a n N e t w o r k C o n s t r u c t i o n ……………. 7 73 . 4 . 1 . 2 I n f e r e n c e i n a B a y e s i a n N e t w o r k……………. 7 83 . 4 . 1 . 3 L e a r n i n g B a y e s i a n N e t w o r k s ……………… 7 8
3 . 4 . 2 H i d d e n M a r k o v M o d e l s…………………….. 7 8
3 . 4 . 2 . 1 T h e I n f e r e n c e a n d L e a r n i n g A l g o r i t h m s………… 7 9
3.4.3 Markov Random Fields . . …………………… 8 1
3 . 4 . 3 . 1 C o n d i t i o n a l I n d e p e n d e n c e ………………. 8 13 . 4 . 3 . 2 C l i q u e F a c t o r i z a t i o n …………………. 8 1
3 . 4 . 3 . 3 T h e I n f e r e n c e a n d L e a r n i n g A l g o r i t h m s………… 8 2
3.4.4 Conditional Random Fields …………………… 8 2
3 . 4 . 4 . 1 T h e L e a r n i n g A l g o r i t h m s ……………….. 8 3
3 . 5 S u m m a r y ……………………………….. 8 3

Contents xi
4 Decision Trees: Theory and Algorithms 87
Victor E. Lee, Lin Liu, and Ruoming Jin
4.1 Introduction . . . . . ………………………….. 8 7
4.2 Top-Down Decision Tree Induction …………………… 9 1
4.2.1 Node Splitting . . . ………………………. 9 2
4 . 2 . 2 T r e e P r u n i n g………………………….. 9 7
4 . 3 C a s e S t u d i e s w i t h C 4 . 5 a n d C A R T ……………………. 9 9
4.3.1 Splitting Criteria . . ………………………. 1 0 0
4.3.2 Stopping Conditions ………………………. 1 0 0
4 . 3 . 3 P r u n i n g S t r a t e g y………………………… 1 0 14.3.4 Handling Unknown V alues: Induction and Prediction . ……… 1 0 1
4 . 3 . 5 O t h e r I s s u e s : W i n d o w i n g a n d M u l t i v a r i a t e C r i t e r i a ………… 1 0 2
4 . 4 S c a l a b l e D e c i s i o n T r e e C o n s t r u c t i o n …………………… 1 0 3
4 . 4 . 1 R a i n F o r e s t – B a s e d A p p r o a c h …………………… 1 0 4
4 . 4 . 2 S P I E S A p p r o a c h ………………………… 1 0 5
4 . 4 . 3 P a r a l l e l D e c i s i o n T r e e C o n s t r u c t i o n ……………….. 1 0 7
4.5 Incremental Decision Tree Induction …………………… 1 0 8
4 . 5 . 1 I D 3 F a m i l y …………………………… 1 0 84 . 5 . 2 V F D T F a m i l y …………………………. 1 1 0
4 . 5 . 3 E n s e m b l e M e t h o d f o r S t r e a m i n g D a t a ……………… 1 1 3
4 . 6 S u m m a r y ……………………………….. 1 1 4
5 Rule-Based Classiﬁcation 121
Xiao-Li Li and Bing Liu
5.1 Introduction . . . . . ………………………….. 1 2 1
5.2 Rule Induction . . . ………………………….. 1 2 3
5.2.1 Two Algorithms for Rule Induction . . . …………….. 1 2 3
5.2.1.1 CN2 Induction Algorithm (Ordered Rules) . ……… 1 2 4
5.2.1.2 RIPPER Algorithm and Its V ariations (Ordered Classes) . . . 125
5 . 2 . 2 L e a r n O n e R u l e i n R u l e L e a r n i n g………………… 1 2 6
5 . 3 C l a s s i ﬁ c a t i o n B a s e d o n A s s o c i a t i o n R u l e M i n i n g …………….. 1 2 9
5 . 3 . 1 A s s o c i a t i o n R u l e M i n i n g ……………………. 1 3 0
5.3.1.1 Deﬁnitions of Association Rules, Support, and Conﬁdence . . 131
5.3.1.2 The Introduction of Apriori Algorithm …………. 1 3 3
5 . 3 . 2 M i n i n g C l a s s A s s o c i a t i o n R u l e s …………………. 1 3 6
5 . 3 . 3 C l a s s i ﬁ c a t i o n B a s e d o n A s s o c i a t i o n s ………………. 1 3 9
5 . 3 . 3 . 1 A d d i t i o n a l D i s c u s s i o n f o r C A R s M i n i n g ………… 1 3 95 . 3 . 3 . 2 B u i l d i n g a C l a s s i ﬁ e r U s i n g C A R s …………… 1 4 0
5 . 3 . 4 O t h e r T e c h n i q u e s f o r A s s o c i a t i o n R u l e – B a s e d C l a s s i ﬁ c a t i o n ……. 1 4 2
5 . 4 A p p l i c a t i o n s………………………………. 1 4 4
5 . 4 . 1 T e x t C a t e g o r i z a t i o n ………………………. 1 4 4
5 . 4 . 2 I n t r u s i o n D e t e c t i o n ………………………. 1 4 75.4.3 Using Class Association Rules for Diagnostic Data Mining . . ….. 1 4 8
5 . 4 . 4 G e n e E x p r e s s i o n D a t a A n a l y s i s…………………. 1 4 9
5 . 5 D i s c u s s i o n a n d C o n c l u s i o n ……………………….. 1 5 0

xii Contents
6 Instance-Based Learning: A Survey 157
Charu C. Aggarwal
6.1 Introduction . . . . . ………………………….. 1 5 7
6 . 2 I n s t a n c e – B a s e d L e a r n i n g F r a m e w o r k …………………… 1 5 96.3 The Nearest Neighbor Classiﬁer . . …………………… 1 6 0
6 . 3 . 1 H a n d l i n g S y m b o l i c A t t r i b u t e s ………………….. 1 6 3
6.3.2 Distance-Weighted Nearest Neighbor Methods . …………. 1 6 3
6 . 3 . 3 L o c a l D i s t a n c e S c a l i n g …………………….. 1 6 46.3.4 Attribute-Weighted Nearest Neighbor Methods . …………. 1 6 4
6.3.5 Locally Adaptive Nearest Neighbor Classiﬁer . …………. 1 6 7
6.3.6 Combining with Ensemble Methods . . …………….. 1 6 9
6.3.7 Multi-Label Learning . . . …………………… 1 6 9
6 . 4 L a z y S V M C l a s s i ﬁ c a t i o n ………………………… 1 7 16 . 5 L o c a l l y W e i g h t e d R e g r e s s i o n ………………………. 1 7 26 . 6 L a z y N a i v e B a y e s ……………………………. 1 7 3
6 . 7 L a z y D e c i s i o n T r e e s ………………………….. 1 7 3
6 . 8 R u l e – B a s e d C l a s s i ﬁ c a t i o n………………………… 1 7 46.9 Radial Basis Function Networks: Leveraging Neural Networks for Instance-Based
L e a r n i n g………………………………… 1 7 5
6.10 Lazy Methods for Diagnostic and Visual Classiﬁcation . …………. 1 7 6
6 . 1 1 C o n c l u s i o n s a n d S u m m a r y ……………………….. 1 8 0
7 Support Vector Machines 187
Po-Wei Wang and Chih-Jen Lin7.1 Introduction . . . . . ………………………….. 1 8 7
7 . 2 T h e M a x i m u m M a r g i n P e r s p e c t i v e ……………………. 1 8 8
7 . 3 T h e R e g u l a r i z a t i o n P e r s p e c t i v e ……………………… 1 9 0
7.4 The Support V ector Perspective . . . …………………… 1 9 1
7 . 5 K e r n e l T r i c k s ……………………………… 1 9 47 . 6 S o l v e r s a n d A l g o r i t h m s …………………………. 1 9 67.7 Multiclass Strategies ………………………….. 1 9 8
7 . 8 C o n c l u s i o n ………………………………. 2 0 1
8 Neural Networks: A Review 205
Alain Biem
8.1 Introduction . . . . . ………………………….. 2 0 6
8.2 Fundamental Concepts . . . ………………………. 2 0 8
8 . 2 . 1 M a t h e m a t i c a l M o d e l o f a N e u r o n………………… 2 0 8
8 . 2 . 2 T y p e s o f U n i t s…………………………. 2 0 9
8.2.2.1 McCullough Pitts Binary Threshold Unit . . ……… 2 0 9
8 . 2 . 2 . 2 L i n e a r U n i t ……………………… 2 1 08 . 2 . 2 . 3 L i n e a r T h r e s h o l d U n i t ………………… 2 1 18 . 2 . 2 . 4 S i g m o i d a l U n i t ……………………. 2 1 1
8 . 2 . 2 . 5 D i s t a n c e U n i t…………………….. 2 1 1
8 . 2 . 2 . 6 R a d i a l B a s i s U n i t…………………… 2 1 1
8.2.2.7 Polynomial Unit …………………… 2 1 2
8.2.3 Network Topology . ………………………. 2 1 2
8 . 2 . 3 . 1 L a y e r e d N e t w o r k…………………… 2 1 28 . 2 . 3 . 2 N e t w o r k s w i t h F e e d b a c k……………….. 2 1 2
8.2.3.3 Modular Networks . . . . . . …………….. 2 1 3
8 . 2 . 4 C o m p u t a t i o n a n d K n o w l e d g e R e p r e s e n t a t i o n …………… 2 1 3

Contents xiii
8 . 2 . 5 L e a r n i n g……………………………. 2 1 3
8 . 2 . 5 . 1 H e b b i a n R u l e…………………….. 2 1 3
8 . 2 . 5 . 2 T h e D e l t a R u l e ……………………. 2 1 4
8 . 3 S i n g l e – L a y e r N e u r a l N e t w o r k………………………. 2 1 4
8 . 3 . 1 T h e S i n g l e – L a y e r P e r c e p t r o n ………………….. 2 1 4
8 . 3 . 1 . 1 P e r c e p t r o n C r i t e r i o n …………………. 2 1 48.3.1.2 Multi-Class Perceptrons . . . …………….. 2 1 6
8 . 3 . 1 . 3 P e r c e p t r o n E n h a n c e m e n t s ………………. 2 1 6
8 . 3 . 2 A d a l i n e ……………………………. 2 1 7
8 . 3 . 2 . 1 T w o – C l a s s A d a l i n e ………………….. 2 1 78.3.2.2 Multi-Class Adaline . . . . . …………….. 2 1 8
8 . 3 . 3 L e a r n i n g V e c t o r Q u a n t i z a t i o n ( L V Q ) ………………. 2 1 9
8 . 3 . 3 . 1 L V Q 1 T r a i n i n g ……………………. 2 1 9
8 . 3 . 3 . 2 L V Q 2 T r a i n i n g ……………………. 2 1 9
8 . 3 . 3 . 3 A p p l i c a t i o n a n d L i m i t a t i o n s ……………… 2 2 0
8 . 4 K e r n e l N e u r a l N e t w o r k …………………………. 2 2 0
8 . 4 . 1 R a d i a l B a s i s F u n c t i o n N e t w o r k…………………. 2 2 08 . 4 . 2 R B F N T r a i n i n g ………………………… 2 2 2
8 . 4 . 2 . 1 U s i n g T r a i n i n g S a m p l e s a s C e n t e r s ………….. 2 2 2
8.4.2.2 Random Selection of Centers . …………….. 2 2 2
8 . 4 . 2 . 3 U n s u p e r v i s e d S e l e c t i o n o f C e n t e r s…………… 2 2 28 . 4 . 2 . 4 S u p e r v i s e d E s t i m a t i o n o f C e n t e r s …………… 2 2 3
8 . 4 . 2 . 5 L i n e a r O p t i m i z a t i o n o f W e i g h t s ……………. 2 2 38 . 4 . 2 . 6 G r a d i e n t D e s c e n t a n d E n h a n c e m e n t s ………….. 2 2 3
8 . 4 . 3 R B F A p p l i c a t i o n s ……………………….. 2 2 3
8.5 Multi-Layer Feedforward Network . …………………… 2 2 4
8 . 5 . 1 M L P A r c h i t e c t u r e f o r C l a s s i ﬁ c a t i o n ………………. 2 2 4
8 . 5 . 1 . 1 T w o – C l a s s P r o b l e m s …………………. 2 2 58.5.1.2 Multi-Class Problems . . . . . …………….. 2 2 5
8 . 5 . 1 . 3 F o r w a r d P r o p a g a t i o n…………………. 2 2 6
8 . 5 . 2 E r r o r M e t r i c s …………………………. 2 2 7
8 . 5 . 2 . 1 M e a n S q u a r e E r r o r ( M S E ) ………………. 2 2 7
8 . 5 . 2 . 2 C r o s s – E n t r o p y ( C E ) …………………. 2 2 7
8 . 5 . 2 . 3 M i n i m u m C l a s s i ﬁ c a t i o n E r r o r ( M C E ) …………. 2 2 8
8 . 5 . 3 L e a r n i n g b y B a c k p r o p a g a t i o n………………….. 2 2 8
8 . 5 . 4 E n h a n c i n g B a c k p r o p a g a t i o n …………………… 2 2 9
8 . 5 . 4 . 1 B a c k p r o p a g a t i o n w i t h M o m e n t u m…………… 2 3 08 . 5 . 4 . 2 D e l t a – B a r – D e l t a ……………………. 2 3 1
8 . 5 . 4 . 3 R p r o p A l g o r i t h m …………………… 2 3 18 . 5 . 4 . 4 Q u i c k – P r o p ……………………… 2 3 1
8 . 5 . 5 G e n e r a l i z a t i o n I s s u e s ……………………… 2 3 28 . 5 . 6 M o d e l S e l e c t i o n………………………… 2 3 2
8 . 6 D e e p N e u r a l N e t w o r k s …………………………. 2 3 2
8 . 6 . 1 U s e o f P r i o r K n o w l e d g e ……………………. 2 3 38 . 6 . 2 L a y e r – W i s e G r e e d y T r a i n i n g ………………….. 2 3 4
8 . 6 . 2 . 1 D e e p B e l i e f N e t w o r k s ( D B N s ) ……………. 2 3 48 . 6 . 2 . 2 S t a c k A u t o – E n c o d e r …………………. 2 3 5
8 . 6 . 3 L i m i t s a n d A p p l i c a t i o n s…………………….. 2 3 5
8 . 7 S u m m a r y ……………………………….. 2 3 5

xiv Contents
9 A Survey of Stream Classiﬁcation Algorithms 245
Charu C. Aggarwal
9.1 Introduction . . . . . ………………………….. 2 4 5
9 . 2 G e n e r i c S t r e a m C l a s s i ﬁ c a t i o n A l g o r i t h m s ………………… 2 4 7
9 . 2 . 1 D e c i s i o n T r e e s f o r D a t a S t r e a m s ………………… 2 4 7
9.2.2 Rule-Based Methods for Data Streams . …………….. 2 4 9
9.2.3 Nearest Neighbor Methods for Data Streams . . …………. 2 5 0
9.2.4 SVM Methods for Data Streams . . . . …………….. 2 5 1
9 . 2 . 5 N e u r a l N e t w o r k C l a s s i ﬁ e r s f o r D a t a S t r e a m s…………… 2 5 29.2.6 Ensemble Methods for Data Streams . . …………….. 2 5 3
9 . 3 R a r e C l a s s S t r e a m C l a s s i ﬁ c a t i o n …………………….. 2 5 4
9 . 3 . 1 D e t e c t i n g R a r e C l a s s e s …………………….. 2 5 5
9 . 3 . 2 D e t e c t i n g N o v e l C l a s s e s …………………….. 2 5 5
9 . 3 . 3 D e t e c t i n g I n f r e q u e n t l y R e c u r r i n g C l a s s e s …………….. 2 5 6
9 . 4 D i s c r e t e A t t r i b u t e s : T h e M a s s i v e D o m a i n S c e n a r i o ……………. 2 5 6
9 . 5 O t h e r D a t a D o m a i n s ………………………….. 2 6 2
9 . 5 . 1 T e x t S t r e a m s………………………….. 2 6 2
9 . 5 . 2 G r a p h S t r e a m s…………………………. 2 6 4
9 . 5 . 3 U n c e r t a i n D a t a S t r e a m s…………………….. 2 6 7
9 . 6 C o n c l u s i o n s a n d S u m m a r y ……………………….. 2 6 7
10 Big Data Classiﬁcation 275
Hanghang T ong10.1 Introduction . . . . . ………………………….. 2 7 5
1 0 . 2 S c a l e – U p o n a S i n g l e M a c h i n e ……………………… 2 7 6
10.2.1 Background ………………………….. 2 7 6
1 0 . 2 . 2 S V M P e r f ……………………………. 2 7 6
1 0 . 2 . 3 P e g a s o s ……………………………. 2 7 7
10.2.4 Bundle Methods . . ………………………. 2 7 9
1 0 . 3 S c a l e – U p b y P a r a l l e l i s m ………………………… 2 8 0
1 0 . 3 . 1 P a r a l l e l D e c i s i o n T r e e s …………………….. 2 8 01 0 . 3 . 2 P a r a l l e l S V M s…………………………. 2 8 1
1 0 . 3 . 3 M R M – M L …………………………… 2 8 11 0 . 3 . 4 S y s t e m M L…………………………… 2 8 2
1 0 . 4 C o n c l u s i o n ………………………………. 2 8 3
11 Text Classiﬁcation 287
Charu C. Aggarwal and ChengXiang Zhai11.1 Introduction . . . . . ………………………….. 2 8 8
1 1 . 2 F e a t u r e S e l e c t i o n f o r T e x t C l a s s i ﬁ c a t i o n …………………. 2 9 0
1 1 . 2 . 1 G i n i I n d e x…………………………… 2 9 11 1 . 2 . 2 I n f o r m a t i o n G a i n ……………………….. 2 9 21 1 . 2 . 3 M u t u a l I n f o r m a t i o n ………………………. 2 9 2
11.2.4χ
2- S t a t i s t i c …………………………… 2 9 2
11.2.5 Feature Transformation Methods: Unsupervised and Supervised LSI . . 29311.2.6 Supervised Clustering for Dimensionality Reduction . . ……… 2 9 4
1 1 . 2 . 7 L i n e a r D i s c r i m i n a n t A n a l y s i s………………….. 2 9 4
11.2.8 Generalized Singular V alue Decomposition . . . …………. 2 9 5
1 1 . 2 . 9 I n t e r a c t i o n o f F e a t u r e S e l e c t i o n w i t h C l a s s i ﬁ c a t i o n ………… 2 9 6
1 1 . 3 D e c i s i o n T r e e C l a s s i ﬁ e r s ………………………… 2 9 61 1 . 4 R u l e – B a s e d C l a s s i ﬁ e r s …………………………. 2 9 8

Contents xv
11.5 Probabilistic and Naive Bayes Classiﬁers . . . . …………….. 3 0 0
1 1 . 5 . 1 B e r n o u l l i M u l t i v a r i a t e M o d e l………………….. 3 0 1
1 1 . 5 . 2 M u l t i n o m i a l D i s t r i b u t i o n ……………………. 3 0 4
1 1 . 5 . 3 M i x t u r e M o d e l i n g f o r T e x t C l a s s i ﬁ c a t i o n…………….. 3 0 5
1 1 . 6 L i n e a r C l a s s i ﬁ e r s ……………………………. 3 0 8
1 1 . 6 . 1 S V M C l a s s i ﬁ e r s………………………… 3 0 8
1 1 . 6 . 2 R e g r e s s i o n – B a s e d C l a s s i ﬁ e r s ………………….. 3 1 11 1 . 6 . 3 N e u r a l N e t w o r k C l a s s i ﬁ e r s …………………… 3 1 2
11.6.4 Some Observations about Linear Classiﬁers . . …………. 3 1 5
1 1 . 7 P r o x i m i t y – B a s e d C l a s s i ﬁ e r s……………………….. 3 1 5
1 1 . 8 C l a s s i ﬁ c a t i o n o f L i n k e d a n d W e b D a t a ………………….. 3 1 7
1 1 . 9 M e t a – A l g o r i t h m s f o r T e x t C l a s s i ﬁ c a t i o n …………………. 3 2 1
1 1 . 9 . 1 C l a s s i ﬁ e r E n s e m b l e L e a r n i n g………………….. 3 2 111.9.2 Data Centered Methods: Boosting and Bagging …………. 3 2 2
1 1 . 9 . 3 O p t i m i z i n g S p e c i ﬁ c M e a s u r e s o f A c c u r a c y ……………. 3 2 2
1 1 . 1 0L e v e r a g i n g A d d i t i o n a l T r a i n i n g D a t a …………………… 3 2 3
1 1 . 1 0 . 1 S e m i – S u p e r v i s e d L e a r n i n g …………………… 3 2 4
1 1 . 1 0 . 2 T r a n s f e r L e a r n i n g ……………………….. 3 2 61 1 . 1 0 . 3 A c t i v e L e a r n i n g ………………………… 3 2 7
1 1 . 1 1C o n c l u s i o n s a n d S u m m a r y ……………………….. 3 2 7
12 Multimedia Classiﬁcation 337
Shiyu Chang, W ei Han, Xianming Liu, Ning Xu, Pooya Khorrami, and
Thomas S. Huang
12.1 Introduction . . . . . ………………………….. 3 3 8
1 2 . 1 . 1 O v e r v i e w …………………………… 3 3 8
1 2 . 2 F e a t u r e E x t r a c t i o n a n d D a t a P r e – P r o c e s s i n g ……………….. 3 3 9
1 2 . 2 . 1 T e x t F e a t u r e s ………………………….. 3 4 0
1 2 . 2 . 2 I m a g e F e a t u r e s …………………………. 3 4 1
1 2 . 2 . 3 A u d i o F e a t u r e s …………………………. 3 4 41 2 . 2 . 4 V i d e o F e a t u r e s …………………………. 3 4 5
1 2 . 3 A u d i o V i s u a l F u s i o n ………………………….. 3 4 5
12.3.1 Fusion Methods . . ………………………. 3 4 6
12.3.2 Audio Visual Speech Recognition . . . . …………….. 3 4 6
1 2 . 3 . 2 . 1 V i s u a l F r o n t E n d …………………… 3 4 71 2 . 3 . 2 . 2 D e c i s i o n F u s i o n o n H M M ………………. 3 4 8
1 2 . 3 . 3 O t h e r A p p l i c a t i o n s ……………………….. 3 4 9
12.4 Ontology-Based Classiﬁcation and Inference . . …………….. 3 4 9
12.4.1 Popular Applied Ontology . …………………… 3 5 0
1 2 . 4 . 2 O n t o l o g i c a l R e l a t i o n s ……………………… 3 5 0
12.4.2.1 Deﬁnition ………………………. 3 5 1
1 2 . 4 . 2 . 2 S u b c l a s s R e l a t i o n…………………… 3 5 1
1 2 . 4 . 2 . 3 C o – O c c u r r e n c e R e l a t i o n ……………….. 3 5 2
1 2 . 4 . 2 . 4 C o m b i n a t i o n o f t h e T w o R e l a t i o n s…………… 3 5 2
1 2 . 4 . 2 . 5 I n h e r e n t l y U s e d O n t o l o g y ………………. 3 5 3
12.5 Geographical Classiﬁcation with Multimedia Data …………….. 3 5 3
12.5.1 Data Modalities . . ………………………. 3 5 3
1 2 . 5 . 2 C h a l l e n g e s i n G e o g r a p h i c a l C l a s s i ﬁ c a t i o n ……………. 3 5 4
1 2 . 5 . 3 G e o – C l a s s i ﬁ c a t i o n f o r I m a g e s ………………….. 3 5 5
1 2 . 5 . 3 . 1 C l a s s i ﬁ e r s………………………. 3 5 6
1 2 . 5 . 4 G e o – C l a s s i ﬁ c a t i o n f o r W e b V i d e o s ……………….. 3 5 6

xvi Contents
1 2 . 6 C o n c l u s i o n ………………………………. 3 5 6
13 Time Series Data Classiﬁcation 365
Dimitrios Kotsakos and Dimitrios Gunopulos
13.1 Introduction . . . . . ………………………….. 3 6 5
1 3 . 2 T i m e S e r i e s R e p r e s e n t a t i o n ……………………….. 3 6 71 3 . 3 D i s t a n c e M e a s u r e s …………………………… 3 6 7
13.3.1 L
p- N o r m s …………………………… 3 6 7
1 3 . 3 . 2 D y n a m i c T i m e W a r p i n g ( D T W ) …………………. 3 6 71 3 . 3 . 3 E d i t D i s t a n c e …………………………. 3 6 8
13.3.4 Longest Common Subsequence (LCSS) …………….. 3 6 9
13.4 k- N N …………………………………. 3 6 9
13.4.1 Speeding up the k- N N……………………… 3 7 0
13.5 Support V ector Machines (SVMs) . …………………… 3 7 1
1 3 . 6 C l a s s i ﬁ c a t i o n T r e e s …………………………… 3 7 21 3 . 7 M o d e l – B a s e d C l a s s i ﬁ c a t i o n ……………………….. 3 7 41 3 . 8 D i s t r i b u t e d T i m e S e r i e s C l a s s i ﬁ c a t i o n ………………….. 3 7 5
1 3 . 9 C o n c l u s i o n ………………………………. 3 7 5
14 Discrete Sequence Classiﬁcation 379
Mohammad Al Hasan
14.1 Introduction . . . . . ………………………….. 3 7 9
14.2 Background . . . . . ………………………….. 3 8 0
1 4 . 2 . 1 S e q u e n c e ……………………………. 3 8 01 4 . 2 . 2 S e q u e n c e C l a s s i ﬁ c a t i o n …………………….. 3 8 1
1 4 . 2 . 3 F r e q u e n t S e q u e n t i a l P a t t e r n s ………………….. 3 8 1
14.2.4 n- G r a m s ……………………………. 3 8 2
14.3 Sequence Classiﬁcation Methods . . …………………… 3 8 2
1 4 . 4 F e a t u r e – B a s e d C l a s s i ﬁ c a t i o n ………………………. 3 8 2
14.4.1 Filtering Method for Sequential Feature Selection . . . ……… 3 8 3
1 4 . 4 . 2 P a t t e r n M i n i n g F r a m e w o r k f o r M i n i n g S e q u e n t i a l F e a t u r e s ……. 3 8 51 4 . 4 . 3 A W r a p p e r – B a s e d M e t h o d f o r M i n i n g S e q u e n t i a l F e a t u r e s …….. 3 8 6
14.5 Distance-Based Methods . . ………………………. 3 8 6
1 4 . 5 . 0 . 1 A l i g n m e n t – B a s e d D i s t a n c e………………. 3 8 71 4 . 5 . 0 . 2 K e y w o r d – B a s e d D i s t a n c e ……………….. 3 8 81 4 . 5 . 0 . 3 K e r n e l – B a s e d S i m i l a r i t y ……………….. 3 8 8
1 4 . 5 . 0 . 4 M o d e l – B a s e d S i m i l a r i t y ……………….. 3 8 8
1 4 . 5 . 0 . 5 T i m e S e r i e s D i s t a n c e M e t r i c s …………….. 3 8 8
1 4 . 6 M o d e l – B a s e d M e t h o d ………………………….. 3 8 914.7 Hybrid Methods . . . ………………………….. 3 9 0
1 4 . 8 N o n – T r a d i t i o n a l S e q u e n c e C l a s s i ﬁ c a t i o n …………………. 3 9 1
1 4 . 8 . 1 S e m i – S u p e r v i s e d S e q u e n c e C l a s s i ﬁ c a t i o n …………….. 3 9 11 4 . 8 . 2 C l a s s i ﬁ c a t i o n o f L a b e l S e q u e n c e s ………………… 3 9 21 4 . 8 . 3 C l a s s i ﬁ c a t i o n o f S e q u e n c e o f V e c t o r D a t a ……………. 3 9 2
1 4 . 9 C o n c l u s i o n s ………………………………. 3 9 3
15 Collective Classiﬁcation of Network Data 399
Ben London and Lise Getoor
15.1 Introduction . . . . . ………………………….. 3 9 9
15.2 Collective Classiﬁcation Problem Deﬁnition . . . …………….. 4 0 0
15.2.1 Inductive vs. Transductive Learning . . . …………….. 4 0 1

Contents xvii
1 5 . 2 . 2 A c t i v e C o l l e c t i v e C l a s s i ﬁ c a t i o n…………………. 4 0 2
15.3 Iterative Methods . . ………………………….. 4 0 2
1 5 . 3 . 1 L a b e l P r o p a g a t i o n……………………….. 4 0 2
1 5 . 3 . 2 I t e r a t i v e C l a s s i ﬁ c a t i o n A l g o r i t h m s ……………….. 4 0 4
1 5 . 4 G r a p h – B a s e d R e g u l a r i z a t i o n ………………………. 4 0 5
15.5 Probabilistic Graphical Models . . . …………………… 4 0 6
1 5 . 5 . 1 D i r e c t e d M o d e l s………………………… 4 0 6
1 5 . 5 . 2 U n d i r e c t e d M o d e l s ………………………. 4 0 8
1 5 . 5 . 3 A p p r o x i m a t e I n f e r e n c e i n G r a p h i c a l M o d e l s …………… 4 0 9
1 5 . 5 . 3 . 1 G i b b s S a m p l i n g ……………………. 4 0 9
1 5 . 5 . 3 . 2 L o o p y B e l i e f P r o p a g a t i o n ( L B P )……………. 4 1 0
1 5 . 6 F e a t u r e C o n s t r u c t i o n ………………………….. 4 1 0
1 5 . 6 . 1 D a t a G r a p h …………………………… 4 1 1
1 5 . 6 . 2 R e l a t i o n a l F e a t u r e s ………………………. 4 1 2
1 5 . 7 A p p l i c a t i o n s o f C o l l e c t i v e C l a s s i ﬁ c a t i o n …………………. 4 1 2
1 5 . 8 C o n c l u s i o n ………………………………. 4 1 3
16 Uncertain Data Classiﬁcation 417
Reynold Cheng, Yixiang Fang, and Matthias Renz16.1 Introduction . . . . . ………………………….. 4 1 7
1 6 . 2 P r e l i m i n a r i e s ……………………………… 4 1 9
1 6 . 2 . 1 D a t a U n c e r t a i n t y M o d e l s ……………………. 4 1 9
1 6 . 2 . 2 C l a s s i ﬁ c a t i o n F r a m e w o r k ……………………. 4 1 9
1 6 . 3 C l a s s i ﬁ c a t i o n A l g o r i t h m s ………………………… 4 2 0
1 6 . 3 . 1 D e c i s i o n T r e e s…………………………. 4 2 01 6 . 3 . 2 R u l e – B a s e d C l a s s i ﬁ c a t i o n……………………. 4 2 4
1 6 . 3 . 3 A s s o c i a t i v e C l a s s i ﬁ c a t i o n……………………. 4 2 61 6 . 3 . 4 D e n s i t y – B a s e d C l a s s i ﬁ c a t i o n ………………….. 4 2 9
16.3.5 Nearest Neighbor-Based Classiﬁcation . …………….. 4 3 2
16.3.6 Support V ector Classiﬁcation . . . . . . …………….. 4 3 6
1 6 . 3 . 7 N a i v e B a y e s C l a s s i ﬁ c a t i o n …………………… 4 3 8
1 6 . 4 C o n c l u s i o n s ………………………………. 4 4 1
17 Rare Class Learning 445
Charu C. Aggarwal
17.1 Introduction . . . . . ………………………….. 4 4 5
1 7 . 2 R a r e C l a s s D e t e c t i o n ………………………….. 4 4 8
1 7 . 2 . 1 C o s t S e n s i t i v e L e a r n i n g…………………….. 4 4 9
1 7 . 2 . 1 . 1 M e t a C o s t : A R e l a b e l i n g A p p r o a c h …………… 4 4 917.2.1.2 Weighting Methods . . . . . . …………….. 4 5 0
1 7 . 2 . 1 . 3 B a y e s C l a s s i ﬁ e r s …………………… 4 5 0
1 7 . 2 . 1 . 4 P r o x i m i t y – B a s e d C l a s s i ﬁ e r s ……………… 4 5 1
1 7 . 2 . 1 . 5 R u l e – B a s e d C l a s s i ﬁ e r s ………………… 4 5 11 7 . 2 . 1 . 6 D e c i s i o n T r e e s ……………………. 4 5 1
1 7 . 2 . 1 . 7 S V M C l a s s i ﬁ e r ……………………. 4 5 2
1 7 . 2 . 2 A d a p t i v e R e – S a m p l i n g …………………….. 4 5 2
1 7 . 2 . 2 . 1 R e l a t i o n b e t w e e n W e i g h t i n g a n d S a m p l i n g ………. 4 5 3
1 7 . 2 . 2 . 2 S y n t h e t i c O v e r – S a m p l i n g : S M O T E …………… 4 5 31 7 . 2 . 2 . 3 O n e C l a s s L e a r n i n g w i t h P o s i t i v e C l a s s ………… 4 5 3
1 7 . 2 . 2 . 4 E n s e m b l e T e c h n i q u e s …………………. 4 5 4
17.2.3 Boosting Methods . ………………………. 4 5 4

xviii Contents
1 7 . 3 T h e S e m i – S u p e r v i s e d S c e n a r i o : P o s i t i v e a n d U n l a b e l e d D a t a ……….. 4 5 5
1 7 . 3 . 1 D i f ﬁ c u l t C a s e s a n d O n e – C l a s s L e a r n i n g …………….. 4 5 6
1 7 . 4 T h e S e m i – S u p e r v i s e d S c e n a r i o : N o v e l C l a s s D e t e c t i o n ………….. 4 5 6
1 7 . 4 . 1 O n e C l a s s N o v e l t y D e t e c t i o n ………………….. 4 5 7
1 7 . 4 . 2 C o m b i n i n g N o v e l C l a s s D e t e c t i o n w i t h R a r e C l a s s D e t e c t i o n …… 4 5 8
1 7 . 4 . 3 O n l i n e N o v e l t y D e t e c t i o n……………………. 4 5 8
1 7 . 5 H u m a n S u p e r v i s i o n …………………………… 4 5 9
1 7 . 6 O t h e r W o r k ………………………………. 4 6 1
1 7 . 7 C o n c l u s i o n s a n d S u m m a r y ……………………….. 4 6 2
18 Distance Metric Learning for Data Classiﬁcation 469
Fei W ang18.1 Introduction . . . . . ………………………….. 4 6 9
18.2 The Deﬁnition of Distance Metric Learning . . . …………….. 4 7 0
1 8 . 3 S u p e r v i s e d D i s t a n c e M e t r i c L e a r n i n g A l g o r i t h m s …………….. 4 7 1
1 8 . 3 . 1 L i n e a r D i s c r i m i n a n t A n a l y s i s ( L D A ) ………………. 4 7 2
1 8 . 3 . 2 M a r g i n M a x i m i z i n g D i s c r i m i n a n t A n a l y s i s ( M M D A ) ………. 4 7 31 8 . 3 . 3 L e a r n i n g w i t h S i d e I n f o r m a t i o n ( L S I ) ………………. 4 7 4
18.3.4 Relevant Component Analysis (RCA) . . …………….. 4 7 4
1 8 . 3 . 5 I n f o r m a t i o n T h e o r e t i c M e t r i c L e a r n i n g ( I T M L ) …………. 4 7 518.3.6 Neighborhood Component Analysis (NCA) . . …………. 4 7 5
18.3.7 Average Neighborhood Margin Maximization (ANMM) ……… 4 7 6
18.3.8 Large Margin Nearest Neighbor Classiﬁer (LMNN) . . ……… 4 7 6
1 8 . 4 A d v a n c e d T o p i c s ……………………………. 4 7 7
1 8 . 4 . 1 S e m i – S u p e r v i s e d M e t r i c L e a r n i n g ………………… 4 7 7
1 8 . 4 . 1 . 1 L a p l a c i a n R e g u l a r i z e d M e t r i c L e a r n i n g ( L R M L ) ……. 4 7 71 8 . 4 . 1 . 2 C o n s t r a i n t M a r g i n M a x i m i z a t i o n ( C M M ) ……….. 4 7 8
1 8 . 4 . 2 O n l i n e L e a r n i n g………………………… 4 7 8
18.4.2.1 Pseudo-Metric Online Learning Algorithm (POLA) . ….. 4 7 9
18.4.2.2 Online Information Theoretic Metric Learning (OITML) . . . 480
1 8 . 5 C o n c l u s i o n s a n d D i s c u s s i o n s ………………………. 4 8 0
19 Ensemble Learning 483
Y aliang Li, Jing Gao, Qi Li, and W ei Fan19.1 Introduction . . . . . ………………………….. 4 8 4
19.2 Bayesian Methods . . ………………………….. 4 8 7
1 9 . 2 . 1 B a y e s O p t i m a l C l a s s i ﬁ e r ……………………. 4 8 71 9 . 2 . 2 B a y e s i a n M o d e l A v e r a g i n g …………………… 4 8 8
1 9 . 2 . 3 B a y e s i a n M o d e l C o m b i n a t i o n ………………….. 4 9 0
1 9 . 3 B a g g i n g ………………………………… 4 9 1
1 9 . 3 . 1 G e n e r a l I d e a………………………….. 4 9 119.3.2 Random Forest . . . ………………………. 4 9 3
1 9 . 4 B o o s t i n g………………………………… 4 9 5
1 9 . 4 . 1 G e n e r a l B o o s t i n g P r o c e d u r e …………………… 4 9 51 9 . 4 . 2 A d a B o o s t …………………………… 4 9 6
1 9 . 5 S t a c k i n g ………………………………… 4 9 8
1 9 . 5 . 1 G e n e r a l S t a c k i n g P r o c e d u r e…………………… 4 9 8
1 9 . 5 . 2 S t a c k i n g a n d C r o s s – V a l i d a t i o n …………………. 5 0 0
1 9 . 5 . 3 D i s c u s s i o n s ………………………….. 5 0 1
19.6 Recent Advances in Ensemble Learning . . . . . …………….. 5 0 2
1 9 . 7 C o n c l u s i o n s ………………………………. 5 0 3

Contents xix
20 Semi-Supervised Learning 511
Kaushik Sinha
20.1 Introduction . . . . . ………………………….. 5 1 1
20.1.1 Transductive vs. Inductive Semi-Supervised Learning . ……… 5 1 4
2 0 . 1 . 2 S e m i – S u p e r v i s e d L e a r n i n g F r a m e w o r k a n d A s s u m p t i o n s …….. 5 1 4
2 0 . 2 G e n e r a t i v e M o d e l s …………………………… 5 1 5
2 0 . 2 . 1 A l g o r i t h m s …………………………… 5 1 62 0 . 2 . 2 D e s c r i p t i o n o f a R e p r e s e n t a t i v e A l g o r i t h m ……………. 5 1 6
2 0 . 2 . 3 T h e o r e t i c a l J u s t i ﬁ c a t i o n a n d R e l e v a n t R e s u l t s ………….. 5 1 7
2 0 . 3 C o – T r a i n i n g ………………………………. 5 1 9
2 0 . 3 . 1 A l g o r i t h m s …………………………… 5 2 02 0 . 3 . 2 D e s c r i p t i o n o f a R e p r e s e n t a t i v e A l g o r i t h m ……………. 5 2 0
2 0 . 3 . 3 T h e o r e t i c a l J u s t i ﬁ c a t i o n a n d R e l e v a n t R e s u l t s ………….. 5 2 0
20.4 Graph-Based Methods . . . ………………………. 5 2 2
2 0 . 4 . 1 A l g o r i t h m s …………………………… 5 2 2
2 0 . 4 . 1 . 1 G r a p h C u t ……………………… 5 2 2
2 0 . 4 . 1 . 2 G r a p h T r a n s d u c t i o n ………………….. 5 2 32 0 . 4 . 1 . 3 M a n i f o l d R e g u l a r i z a t i o n ……………….. 5 2 4
20.4.1.4 Random Walk . . …………………… 5 2 5
2 0 . 4 . 1 . 5 L a r g e S c a l e L e a r n i n g…………………. 5 2 6
2 0 . 4 . 2 D e s c r i p t i o n o f a R e p r e s e n t a t i v e A l g o r i t h m ……………. 5 2 6
2 0 . 4 . 3 T h e o r e t i c a l J u s t i ﬁ c a t i o n a n d R e l e v a n t R e s u l t s ………….. 5 2 7
20.5 Semi-Supervised Learning Methods Based on Cluster Assumption . . . ….. 5 2 8
2 0 . 5 . 1 A l g o r i t h m s …………………………… 5 2 82 0 . 5 . 2 D e s c r i p t i o n o f a R e p r e s e n t a t i v e A l g o r i t h m ……………. 5 2 92 0 . 5 . 3 T h e o r e t i c a l J u s t i ﬁ c a t i o n a n d R e l e v a n t R e s u l t s ………….. 5 2 9
2 0 . 6 R e l a t e d A r e a s ……………………………… 5 3 12 0 . 7 C o n c l u d i n g R e m a r k s ………………………….. 5 3 1
21 Transfer Learning 537
Sinno Jialin Pan
21.1 Introduction . . . . . ………………………….. 5 3 8
2 1 . 2 T r a n s f e r L e a r n i n g O v e r v i e w ………………………. 5 4 1
21.2.1 Background ………………………….. 5 4 1
2 1 . 2 . 2 N o t a t i o n s a n d D e ﬁ n i t i o n s……………………. 5 4 1
21.3 Homogenous Transfer Learning . . …………………… 5 4 2
2 1 . 3 . 1 I n s t a n c e – B a s e d A p p r o a c h……………………. 5 4 2
2 1 . 3 . 1 . 1 C a s e I : N o T a r g e t L a b e l e d D a t a ……………. 5 4 3
2 1 . 3 . 1 . 2 C a s e I I : A F e w T a r g e t L a b e l e d D a t a ………….. 5 4 4
2 1 . 3 . 2 F e a t u r e – R e p r e s e n t a t i o n – B a s e d A p p r o a c h…………….. 5 4 5
2 1 . 3 . 2 . 1 E n c o d i n g S p e c i ﬁ c K n o w l e d g e f o r F e a t u r e L e a r n i n g …… 5 4 521.3.2.2 Learning Features by Minimizing Distance between Distribu-
t i o n s …………………………. 5 4 8
21.3.2.3 Learning Features Inspired by Multi-Task Learning . ….. 5 4 9
21.3.2.4 Learning Features Inspired by Self-Taught Learning ….. 5 5 0
2 1 . 3 . 2 . 5 O t h e r F e a t u r e L e a r n i n g A p p r o a c h e s………….. 5 5 0
2 1 . 3 . 3 M o d e l – P a r a m e t e r – B a s e d A p p r o a c h ……………….. 5 5 0
2 1 . 3 . 4 R e l a t i o n a l – I n f o r m a t i o n – B a s e d A p p r o a c h e s……………. 5 5 2
2 1 . 4 H e t e r o g e n e o u s T r a n s f e r L e a r n i n g ……………………. 5 5 3
21.4.1 Heterogeneous Feature Spaces . . . . . …………….. 5 5 3
21.4.2 Different Label Spaces . . …………………… 5 5 4

xx Contents
21.5 Transfer Bounds and Negative Transfer . . . . . …………….. 5 5 4
2 1 . 6 O t h e r R e s e a r c h I s s u e s………………………….. 5 5 5
21.6.1 Binary Classiﬁcation vs. Multi-Class Classiﬁcation . . ……… 5 5 6
21.6.2 Knowledge Transfer from Multiple Source Domains . . ……… 5 5 6
2 1 . 6 . 3 T r a n s f e r L e a r n i n g M e e t s A c t i v e L e a r n i n g …………….. 5 5 6
2 1 . 7 A p p l i c a t i o n s o f T r a n s f e r L e a r n i n g ……………………. 5 5 7
2 1 . 7 . 1 N L P A p p l i c a t i o n s ……………………….. 5 5 7
2 1 . 7 . 2 W e b – B a s e d A p p l i c a t i o n s ……………………. 5 5 7
2 1 . 7 . 3 S e n s o r – B a s e d A p p l i c a t i o n s …………………… 5 5 7
2 1 . 7 . 4 A p p l i c a t i o n s t o C o m p u t e r V i s i o n ………………… 5 5 72 1 . 7 . 5 A p p l i c a t i o n s t o B i o i n f o r m a t i c s …………………. 5 5 7
2 1 . 7 . 6 O t h e r A p p l i c a t i o n s ……………………….. 5 5 8
2 1 . 8 C o n c l u d i n g R e m a r k s ………………………….. 5 5 8
22 Active Learning: A Survey 571
Charu C. Aggarwal, Xiangnan Kong, Q uanquan Gu, Jiawei H an, and Philip S. Yu
22.1 Introduction . . . . . ………………………….. 5 7 2
2 2 . 2 M o t i v a t i o n a n d C o m p a r i s o n s t o O t h e r S t r a t e g i e s……………… 5 7 4
2 2 . 2 . 1 C o m p a r i s o n w i t h O t h e r F o r m s o f H u m a n F e e d b a c k ……….. 5 7 52 2 . 2 . 2 C o m p a r i s o n s w i t h S e m i – S u p e r v i s e d a n d T r a n s f e r L e a r n i n g ……. 5 7 6
2 2 . 3 Q u e r y i n g S t r a t e g i e s …………………………… 5 7 6
2 2 . 3 . 1 H e t e r o g e n e i t y – B a s e d M o d e l s ………………….. 5 7 7
2 2 . 3 . 1 . 1 U n c e r t a i n t y S a m p l i n g ………………… 5 7 7
22.3.1.2 Query-by-Committee . . . . . …………….. 5 7 8
2 2 . 3 . 1 . 3 E x p e c t e d M o d e l C h a n g e ……………….. 5 7 8
2 2 . 3 . 2 P e r f o r m a n c e – B a s e d M o d e l s…………………… 5 7 9
2 2 . 3 . 2 . 1 E x p e c t e d E r r o r R e d u c t i o n ………………. 5 7 9
2 2 . 3 . 2 . 2 E x p e c t e d V a r i a n c e R e d u c t i o n …………….. 5 8 0
2 2 . 3 . 3 R e p r e s e n t a t i v e n e s s – B a s e d M o d e l s ……………….. 5 8 02 2 . 3 . 4 H y b r i d M o d e l s …………………………. 5 8 0
2 2 . 4 A c t i v e L e a r n i n g w i t h T h e o r e t i c a l G u a r a n t e e s ………………. 5 8 1
2 2 . 4 . 1 A S i m p l e E x a m p l e ……………………….. 5 8 12 2 . 4 . 2 E x i s t i n g W o r k s ………………………… 5 8 2
2 2 . 4 . 3 P r e l i m i n a r i e s ………………………….. 5 8 22 2 . 4 . 4 I m p o r t a n c e W e i g h t e d A c t i v e L e a r n i n g ……………… 5 8 2
2 2 . 4 . 4 . 1 A l g o r i t h m ………………………. 5 8 32 2 . 4 . 4 . 2 C o n s i s t e n c y……………………… 5 8 3
2 2 . 4 . 4 . 3 L a b e l C o m p l e x i t y …………………… 5 8 4
2 2 . 5 D e p e n d e n c y – O r i e n t e d D a t a T y p e s f o r A c t i v e L e a r n i n g ………….. 5 8 5
2 2 . 5 . 1 A c t i v e L e a r n i n g i n S e q u e n c e s………………….. 5 8 52 2 . 5 . 2 A c t i v e L e a r n i n g i n G r a p h s …………………… 5 8 5
2 2 . 5 . 2 . 1 C l a s s i ﬁ c a t i o n o f M a n y S m a l l G r a p h s …………. 5 8 6
2 2 . 5 . 2 . 2 N o d e C l a s s i ﬁ c a t i o n i n a S i n g l e L a r g e G r a p h ………. 5 8 7
22.6 Advanced Methods . ………………………….. 5 8 9
2 2 . 6 . 1 A c t i v e L e a r n i n g o f F e a t u r e s…………………… 5 8 9
2 2 . 6 . 2 A c t i v e L e a r n i n g o f K e r n e l s …………………… 5 9 0
2 2 . 6 . 3 A c t i v e L e a r n i n g o f C l a s s e s …………………… 5 9 12 2 . 6 . 4 S t r e a m i n g A c t i v e L e a r n i n g …………………… 5 9 1
2 2 . 6 . 5 M u l t i – I n s t a n c e A c t i v e L e a r n i n g…………………. 5 9 2
22.6.6 Multi-Label Active Learning . . . . . . …………….. 5 9 3
22.6.7 Multi-Task Active Learning …………………… 5 9 3

Contents xxi
22.6.8 Multi-View Active Learning …………………… 5 9 4
22.6.9 Multi-Oracle Active Learning . . . . . . …………….. 5 9 4
22.6.10 Multi-Objective Active Learning . . . . …………….. 5 9 5
2 2 . 6 . 1 1 V a r i a b l e L a b e l i n g C o s t s…………………….. 5 9 6
2 2 . 6 . 1 2 A c t i v e T r a n s f e r L e a r n i n g ……………………. 5 9 6
2 2 . 6 . 1 3 A c t i v e R e i n f o r c e m e n t L e a r n i n g…………………. 5 9 7
2 2 . 7 C o n c l u s i o n s ………………………………. 5 9 7
23 Visual Classiﬁcation 607
Giorgio Maria Di Nunzio
23.1 Introduction . . . . . ………………………….. 6 0 8
2 3 . 1 . 1 R e q u i r e m e n t s f o r V i s u a l C l a s s i ﬁ c a t i o n ……………… 6 0 9
23.1.2 Visualization Metaphors . . …………………… 6 1 0
23.1.2.1 2D and 3D Spaces . . . . . . …………….. 6 1 0
23.1.2.2 More Complex Metaphors . . …………….. 6 1 0
2 3 . 1 . 3 C h a l l e n g e s i n V i s u a l C l a s s i ﬁ c a t i o n ……………….. 6 1 1
2 3 . 1 . 4 R e l a t e d W o r k s…………………………. 6 1 1
2 3 . 2 A p p r o a c h e s ………………………………. 6 1 2
2 3 . 2 . 1 N o m o g r a m s ………………………….. 6 1 2
23.2.1.1 Na¨ ı v e B a y e s N o m o g r a m ……………….. 6 1 3
2 3 . 2 . 2 P a r a l l e l C o o r d i n a t e s………………………. 6 1 3
2 3 . 2 . 2 . 1 E d g e C l u t t e r i n g……………………. 6 1 4
2 3 . 2 . 3 R a d i a l V i s u a l i z a t i o n s ……………………… 6 1 4
2 3 . 2 . 3 . 1 S t a r C o o r d i n a t e s …………………… 6 1 5
2 3 . 2 . 4 S c a t t e r P l o t s ………………………….. 6 1 6
2 3 . 2 . 4 . 1 C l u s t e r i n g ………………………. 6 1 723.2.4.2 Na¨ ı v e B a y e s C l a s s i ﬁ c a t i o n………………. 6 1 7
23.2.5 Topological Maps . ………………………. 6 1 9
2 3 . 2 . 5 . 1 S e l f – O r g a n i z i n g M a p s ………………… 6 1 923.2.5.2 Generative Topographic Mapping . . …………. 6 1 9
2 3 . 2 . 6 T r e e s……………………………… 6 2 0
2 3 . 2 . 6 . 1 D e c i s i o n T r e e s ……………………. 6 2 12 3 . 2 . 6 . 2 T r e e m a p ……………………….. 6 2 2
2 3 . 2 . 6 . 3 H y p e r b o l i c T r e e ……………………. 6 2 3
2 3 . 2 . 6 . 4 P h y l o g e n e t i c T r e e s ………………….. 6 2 3
2 3 . 3 S y s t e m s ………………………………… 6 2 3
2 3 . 3 . 1 E n s e m b l e M a t r i x a n d M a n i M a t r i x………………… 6 2 32 3 . 3 . 2 S y s t e m a t i c M a p p i n g ………………………. 6 2 4
2 3 . 3 . 3 i V i s C l a s s i ﬁ e r ………………………….. 6 2 4
2 3 . 3 . 4 P a r a l l e l T o p i c s …………………………. 6 2 52 3 . 3 . 5 V i s B r i c k s …………………………… 6 2 5
2 3 . 3 . 6 W H I D E ……………………………. 6 2 5
2 3 . 3 . 7 T e x t D o c u m e n t R e t r i e v a l ……………………. 6 2 5
2 3 . 4 S u m m a r y a n d C o n c l u s i o n s ……………………….. 6 2 6
24 Evaluation of Classiﬁcation Methods 633
Nele V erbiest, Karel V ermeulen, and Ankur Teredesai24.1 Introduction . . . . . ………………………….. 6 3 3
2 4 . 2 V a l i d a t i o n S c h e m e s …………………………… 6 3 42 4 . 3 E v a l u a t i o n M e a s u r e s ………………………….. 6 3 6
2 4 . 3 . 1 A c c u r a c y R e l a t e d M e a s u r e s…………………… 6 3 6

xxii Contents
2 4 . 3 . 1 . 1 D i s c r e t e C l a s s i ﬁ e r s………………….. 6 3 6
24.3.1.2 Probabilistic Classiﬁers . . . . …………….. 6 3 8
2 4 . 3 . 2 A d d i t i o n a l M e a s u r e s ………………………. 6 4 2
2 4 . 4 C o m p a r i n g C l a s s i ﬁ e r s………………………….. 6 4 3
2 4 . 4 . 1 P a r a m e t r i c S t a t i s t i c a l C o m p a r i s o n s……………….. 6 4 4
2 4 . 4 . 1 . 1 P a i r w i s e C o m p a r i s o n s ………………… 6 4 424.4.1.2 Multiple Comparisons . . . . …………….. 6 4 4
2 4 . 4 . 2 N o n – P a r a m e t r i c S t a t i s t i c a l C o m p a r i s o n s …………….. 6 4 6
2 4 . 4 . 2 . 1 P a i r w i s e C o m p a r i s o n s ………………… 6 4 624.4.2.2 Multiple Comparisons . . . . …………….. 6 4 7
2 4 . 4 . 2 . 3 P e r m u t a t i o n T e s t s …………………… 6 5 1
2 4 . 5 C o n c l u d i n g R e m a r k s ………………………….. 6 5 2
25 Educational and Software Resources for Data Classiﬁcation 657
Charu C. Aggarwal25.1 Introduction . . . . . ………………………….. 6 5 7
2 5 . 2 E d u c a t i o n a l R e s o u r c e s …………………………. 6 5 8
25.2.1 Books on Data Classiﬁcation . . . . . . …………….. 6 5 8
25.2.2 Popular Survey Papers on Data Classiﬁcation . . …………. 6 5 8
2 5 . 3 S o f t w a r e f o r D a t a C l a s s i ﬁ c a t i o n …………………….. 6 5 9
2 5 . 3 . 1 D a t a B e n c h m a r k s f o r S o f t w a r e a n d R e s e a r c h …………… 6 6 0
2 5 . 4 S u m m a r y ……………………………….. 6 6 1
Index 667

Editor Biography
Charu C. Aggarwal is a Research Scientist at the IBM T. J. Watson Research Center in Y ork-
town Heights, New Y ork. He completed his B.S. from IIT Kanpur in 1993 and his Ph.D. from
Massachusetts Institute of Technology in 1996. His research interest during his Ph.D. years was incombinatorial optimization (network ﬂow algorithms), and his thesis advisor was Professor James
B. Orlin. He has since worked in the ﬁeld of performance analysis, databases, and data mining. He
has published over 200 papers in refereed conferences and journals, and has applied for or been
granted over 80 patents. He is author or editor o f ten books. Because of the commercial value of the
aforementioned patents, he has received several in vention achievement aw ards and has thrice been
designated a Master Inventor at IBM. He is a recipient of an IBM Corporate Award (2003) for his
work on bio-terrorist threat detec tion in data streams, a recipient of the IBM Outstanding Innovation
Award (2008) for his scientiﬁc contributions to privacy technology, a recipient of the IBM Outstand-
ing Technical Achievement Award (2009) for his work on data streams, and a recipient of an IBM
Research Division Award (2008) for his contributions to System S. He also received the EDBT 2014
T est of Time Award for his work on condensation-based privacy-preserving data mining.
He served as an associate editor of the IEEE Transactions on Knowledge and Data Engineering
from 2004 to 2008. He is an associate editor of the ACM Transactions on Knowledge Discovery
and Data Mining, an action editor of the Data Mining and Knowledge Discovery Journal , editor-in-
chief of the ACM SIGKDD Explorations , and an associate editor of the Knowledge and Information
Systems Journal . He serves on the advisory board of the Lecture Notes on Social Networks , a pub-
lication by Springer. He serves as the vice-president of the SIAM Activity Group on Data Mining ,
which is responsible for all data mining activities o rganized by SIAM, incl uding their main data
mining conference. He is a fellow of the IEEE and the ACM, for “contributions to knowledge dis-covery and data mining algorithms .”
xxiii

Contributors
Charu C. Aggarwal
IBM T. J. Watson Research Center
Y orktown Heights, New Y ork
Salem Alelyani
Arizona State University
Tempe, Arizona
Mohammad Al Hasan
Indiana University – Purdue University
Indianapolis, Indiana
Alain Biem
IBM T. J. Watson Research Center
Y orktown Heights, New Y ork
Shiyu Chang
University of Illinois at Urbana-Champaign
Urbana, Illinois
Yi Chang
Yahoo! Labs
Sunnyvale, California
Reynold Cheng
The University of Hong Kong
Hong Kong
Hongbo Deng
Yahoo! Research
Sunnyvale, California
Giorgio Maria Di Nunzio
University of Padua
Padova, Italy
Wei Fan
Huawei Noah’s Ark Lab
Hong Kong
Yixiang Fang
The University of Hong Kong
Hong KongJing Gao
State University of New Y ork at Buffalo
Buffalo, New Y ork
Quanquan Gu
University of Illinois at Urbana-Champaign
Urbana, Illinois
Dimitrios Gunopulos
University of Athens
Athens, Greece
Jiawei Han
University of Illinois at Urbana-Champaign
Urbana, Illinois
Wei Han
University of Illinois at Urbana-Champaign
Urbana, Illinois
Thomas S. Huang
University of Illinois at Urbana-Champaign
Urbana, Illinois
Ruoming Jin
Kent State University
Kent, Ohio
Xiangnan Kong
University of Illinois at Chicago
Chicago, Illinois
Dimitrios Kotsakos
University of Athens
Athens, Greece
Victor E. Lee
John Carroll University
University Heights, Ohio
Qi Li
State University of New Y ork at Buffalo
Buffalo, New Y ork
xxv

xxvi Contributors
Xiao-Li Li
Institute for Infocomm Research
Singapore
Yaliang Li
State University of New Y ork at Buffalo
Buffalo, New Y ork
Bing Liu
University of Illinois at Chicago
Chicago, Illinois
Huan Liu
Arizona State University
Tempe, Arizona
Lin Liu
Kent State University
Kent, Ohio
Xianming Liu
University of Illinois at Urbana-Champaign
Urbana, Illinois
Ben London
University of Maryland
College Park, Maryland
Sinno Jialin Pan
Institute for Infocomm Research
Sinpapore
Pooya Khorrami
University of Illinois at Urbana-Champaign
Urbana, Illinois
Chih-Jen Lin
National Taiwan University
Taipei, Taiwan
Matthias Renz
University of Munich
Munich, Germany
Kaushik Sinha
Wichita State University
Wichita, KansasYizhou Sun
Northeastern University
Boston, Massachusetts
Jiliang Tang
Arizona State University
Tempe, Arizona
Ankur Teredesai
University of Washington
Tacoma, Washington
Hanghang Tong
City University of New Y ork
New Y ork, New Y ork
Nele Verbiest
Ghent University
Belgium
Karel Vermeulen
Ghent University
Belgium
Fei Wang
IBM T. J. Watson Research Center
Y orktown Heights, New Y ork
Po-Wei Wang
National Taiwan University
Taipei, Taiwan
Ning Xu
University of Illinois at Urbana-Champaign
Urbana, Illinois
Philip S. Yu
University of Illinois at Chicago
Chicago, Illinois
ChengXiang Zhai
University of Illinois at Urbana-Champaign
Urbana, Illinois

Preface
The problem of classiﬁcation is perhaps one of the most widely studied in the data mining and ma-
chine learning communities. This problem has been studied by researchers from several disciplinesover several decades. Applications of classiﬁca tion include a wide variety of problem domains such
as text, multimedia, social networks, and biological data. Furthermore, the problem may be en-countered in a number of different scenarios such as streaming or uncertain data. Classiﬁcation is a
rather diverse topic, and the underlying algorithms depend greatly on the data domain and problem
scenario.
Therefore, this book will focus on three primary aspects of data classiﬁcation. The ﬁrst set of
chapters will focus on the core methods for data classiﬁcation. These include methods such as prob-abilistic classiﬁcation, decision trees, rule-ba sed methods, instance-ba sed techniques, SVM meth-
ods, and neural networks. The second set of chapters will focus on different problem domains and
scenarios such as multimedia data, text data, time-series data, network data, data streams, and un-
certain data. The third set of chapters will focus on different variations of the classiﬁcation problem
such as ensemble methods, visual methods, transfer learning, semi-supervised methods, and active
learning. These are advanced methods, which can be used to enhance the quality of the underlying
classiﬁcation results.
The classiﬁcation problem has been addressed by a number of different communities such as
pattern recognition, databases, data mining, and machine learning. In some cases, the work by thedifferent communities tends to be fragmented, and has not been addressed in a uniﬁed way. This
book will make a conscious effort to address the work of the different communities in a uniﬁed way.
The book will start off with an overview of the basic methods in data classiﬁcation, and then discuss
progressively more reﬁned and complex methods for data classiﬁcation. Special attention will also
be paid to more recent problem domains such as graphs and social networks.
The chapters in the book will be divided into three types:
•Method Chapters: These chapters discuss the key techniques that are commonly used for
classiﬁcation, such as probabilistic methods, d ecision trees, rule-based methods, instance-
based methods, SVM techniques, and neural networks.
•Domain Chapters: These chapters discuss the speciﬁc methods used for different domains
of data such as text data, multimedia data, time-series data, discrete sequence data, network
data, and uncertain data. Many of these chapte rs can also be considered application chap-
ters, because they explore the speciﬁc charact eristics of the problem in a particular domain.
Dedicated chapters are also devoted to large d ata sets and data streams, because of the recent
importance of the big data paradigm.
•Variations and Insights: These chapters discuss the key variations on the classiﬁcation pro-
cess such as classiﬁcation ensembles, rare-cl ass learning, distance function learning, active
learning, and visual learning. Many variations su ch as transfer learning and semi-supervised
learning use side-information in order to enhanc e the classiﬁcation results. A separate chapter
is also devoted to evaluation aspects of classiﬁers.
This book is designed to be comprehensive in its coverage of the entire area of classiﬁcation, and itis hoped that it will serve as a knowledgeable compendium to students and researchers.
xxvii

Chapter 1
An Introduction to Data Classiﬁcation
Charu C. Aggarwal
IBM T. J. Watson Research Center
Yorktown Heights, NYcharu@us.ibm.com
1.1 Introduction ……………………………………………………………. 2
1.2 Common Techniques in Data Classiﬁcation …………………………………. 4
1.2.1 Feature Selection Methods ………………………………………… 4
1.2.2 Probabilistic Methods ……………………………………………. 6
1.2.3 Decision Trees ………………………………………………….. 7
1.2.4 Rule-Based Methods …………………………………………….. 9
1.2.5 Instance-Based Learning ………………………………………….. 11
1.2.6 SVM Classiﬁers …………………………………………………. 11
1.2.7 Neural Networks ………………………………………………… 14
1.3 Handing Different Data Types …………………………………………….. 16
1.3.1 Large Scale Data: Big Data and Data Streams ……………………….. 16
1.3.1.1 Data Streams ……………………………………….. 16
1.3.1.2 The Big Data Framework …………………………….. 17
1.3.2 Text Classiﬁcation ……………………………………………….. 18
1.3.3 Multimedia Classiﬁcation …………………………………………. 20
1.3.4 Time Series and Sequence Data Classiﬁcation ……………………….. 20
1.3.5 Network Data Classiﬁcation ………………………………………. 21
1.3.6 Uncertain Data Classiﬁcation ……………………………………… 21
1.4 V ariations on Data Classiﬁcation ………………………………………….. 22
1.4.1 Rare Class Learning ……………………………………………… 22
1.4.2 Distance Function Learning ……………………………………….. 22
1.4.3 Ensemble Learning for Data Classiﬁcation ………………………….. 23
1.4.4 Enhancing Classiﬁcatio n Methods with Additional Data ………………. 24
1.4.4.1 Semi-Supervised Learning ……………………………. 24
1.4.4.2 Transfer Learning ……………………………………. 26
1.4.5 Incorporating Human Feedback ……………………………………. 27
1.4.5.1 Active Learning …………………………………….. 28
1.4.5.2 Visual Learning ……………………………………… 29
1.4.6 Evaluating Classiﬁcation Algorithms ……………………………….. 30
1.5 Discussion and Conclusions ………………………………………………. 31
Bibliography …………………………………………………………… 31
1

2 Data Classiﬁcation: Algorithms and Applications
1.1 Introduction
The problem of data classiﬁcation has numerous applications in a wide variety of mining ap-
plications. This is because the problem attemp ts to learn the relationship between a set of feature
variables and a target variable of interest. Since many practical problems can be expressed as as-
sociations between feature and target variables, this provides a broad range of applicability of this
model. The problem of classiﬁcation may be stated as follows:
Given a set of training data points along with associated training labels, determine the class la-
bel for an unlabeled test instance.
Numerous variations of this problem can be deﬁned over different settings. Excellent overviews
on data classiﬁcation may be found in [39, 50, 63, 85]. Classiﬁcation algorithms typically contain
two phases:
•Training Phase: In this phase, a model is constructed from the training instances.
•Testing Phase: In this phase, the model is used to assign a label to an unlabeled test instance.
In some cases, such as lazy learning, the training phase is omitted entirely, and the classiﬁcation isperformed directly from the relationship of the training instances to the test instance. Instance-based
methods such as the nearest neighbor classiﬁers are examples of such a scenario. Even in such cases,
a pre-processing phase such as a nearest neighbor index construction may be performed in order toensure efﬁciency during the testing phase.
The output of a classiﬁcation algorithm may be presented for a test instance in one of two ways:
1.Discrete Label: In this case, a label is retu rned for the test instance.
2.Numerical Score: In this case, a numerical score is returned for each class label and test in-
stance combination. Note that the numerical score can be converted to a discrete label for a
test instance, by picking the class with the highe st score for that test instance. The advantage
of a numerical score is that it now becomes possible to compare the relative propensity ofdifferent test instances to belong to a particul ar class of importance, and rank them if needed.
Such methods are used often in rare class detection problems, where the original class distri-bution is highly imbalanced, and the discovery of some classes is more valuable than others.
The classiﬁcation problem thus segments the unseen test instances into groups, as deﬁned by theclass label. While the segmentation of examples into groups is also done by clustering, there is
a key difference between the two problems. In the case of clustering, the segmentation is done
using similarities between the feature variables, w ith no prior understanding of the structure of the
groups. In the case of classiﬁcation, the segmentation is done on the basis of a training data set,
which encodes knowledge about the structure of the groups in the form of a target variable. Thus,
while the segmentations of the data are usually related to notions of similarity, as in clustering,
signiﬁcant deviations from the similarity-based segmentation may be achieved in practical settings.
As a result, the classiﬁcation problem is referred to as supervised learning, just as clustering is
referred to as unsupervised learning . The supervision process often provides signiﬁcant application-
speciﬁc utility, because the class labels may rep resent important properties of interest.
Some common application domains in which the cl assiﬁcation problem arises, are as follows:
•Customer Target Marketing: Since the classiﬁcation problem relates feature variables to
target classes, this method is extremely popular for the problem of customer target marketing.

An Introduction to Data Classiﬁcation 3
In such cases, feature variables describing the customer may be used to predict their buy-
ing interests on the basis of previous training examples. The target variable may encode thebuying interest of the customer.
•Medical Disease Diagnosis: In recent years, the use of data mining methods in medical
technology has gained increasing traction. The features may be extracted from the medicalrecords, and the class labels correspond to whether or not a patient may pick up a disease
in the future. In these cases, it is desirable to make disease predictions with the use of such
information.
•Supervised Event Detection: In many temporal scenarios, class labels may be associated
with time stamps corresponding to unusual events. For example, an intrusion activity maybe represented as a class label. In such cases, time-series classiﬁcation methods can be veryuseful.
•Multimedia Data Analysis: It is often desirable to perform classiﬁcation of large volumes of
multimedia data such as photos, videos, audio o r other more complex m ultimedia data. Mul-
timedia data analysis can often be challenging, because of the complexity of the underlying
feature space and the semantic gap between the feature values and corresponding inferences.
•Biological Data Analysis: Biological data is often represented as discrete sequences, in
which it is desirable to predict the properties of particular sequences. In some cases, thebiological data is also expressed in the form of networks. Therefore, classiﬁcation methods
can be applied in a variety of different ways in this scenario.
•Document Categorization and Filtering: Many applications, such as newswire services,
require the classiﬁcation of large numbers of documents in real time. This application isreferred to as document categorization, and is an important area of research in its own right.
•Social Network Analysis: Many forms of social network analysis, such as collective classi-
ﬁcation, associate labels with the underlying nodes. These are then used in order to predict
the labels of other nodes. Such applications are very useful for predicting useful properties of
actors in a social network.
The diversity of problems that can be addressed by classiﬁcation algorithms is signiﬁcant, and cov-
ers many domains. It is impossible to exhaustively discuss all such applications in either a single
chapter or book. Therefore, this book will organize the area of classiﬁcation into key topics of in-
terest. The work in the data classiﬁcation area typically falls into a number of broad categories;
•Technique-centered: The problem of data classiﬁcation can be solved using numerous
classes of techniques such as decision trees, rule-based methods, neural networks, SVM meth-ods, nearest neighbor methods, and probabilis tic methods. This book will cover the most
popular classiﬁcation methods in th e literature comprehensively.
•Data-Type Centered: Many different data types are created by different applications. Some
examples of different data types include text, multimedia, uncertain data, time series, discretesequence, and network data. Each of these different data types requires the design of different
techniques, each of which can be quite different.
•Variations on Classiﬁcation Analysis: Numerous variations on the standard classiﬁcation
problem exist, which deal with more challenging s cenarios such as rare class learning, transfer
learning, semi-supervised learning, or active learning. Alternatively, different variations of
classiﬁcation, such as ensemble analysis, can be used in order to improve the effectiveness
of classiﬁcation algorithms. These issues ar e of course closely related to issues of model
evaluation. All these issues will be discussed extensively in this book.

4 Data Classiﬁcation: Algorithms and Applications
This chapter will discuss each of these issues in deta il, and will also discuss how the organization of
the book relates to these different areas of data classiﬁcation. The chapter is organized as follows.
The next section discusses the common techniques t hat are used for data classiﬁcation. Section
1.3 explores the use of different data types in the classiﬁcation process. Section 1.4 discusses thedifferent variations of data classiﬁcation. Section 1.5 discusses the conclusions and summary.
1.2 Common Techniques in Data Classiﬁcation
In this section, the different methods that are commonly used for data classiﬁcation will be dis-
cussed. These methods will also be associated with the different chapters in this book. It should
be pointed out that these methods represent the most common techniques used for data classiﬁ-
cation, and it is difﬁcult to comprehensively discuss all the methods in a single book. The most
common methods used in data classiﬁcation are decision trees, rule-based methods, probabilistic
methods, SVM methods, instance-based methods, and neural networks. Each of these methods will
be discussed brieﬂy in this chapter, and all of them will be covered comprehensively in the differentchapters of this book.
1.2.1 Feature Selection Methods
The ﬁrst phase of virtually all classiﬁcation algorithms is that of feature selection. In most data
mining scenarios, a wide variety of features are collected by individuals who are often not domainexperts. Clearly, the irrelevant features may often result in poor modeling, since they are not well
related to the class label. In fact, such features will typically worsen the classiﬁcation accuracy
because of overﬁtting, when the training data set is small and such features are allowed to be a
part of the training model. For example, consider a medical example where the features from the
blood work of different patients are used to predict a particular disease. Clearly, a feature such
as the Cholesterol level is predictive of heart disease, whereas a feature
1such as PSA level is not
predictive of heart disease. However, if a small training data set is used, the PSA level may have
freak correlations with heart disease because of ra ndom variations. While the impact of a single
variable may be small, the cumulative effect of many irrelevant features can be signiﬁcant. This will
result in a training model, that generalizes poorly to unseen test instances. Therefore, it is critical to
use the correct features during the training process.
There are two broad kinds of feature selection methods:
1.Filter Models: In these cases, a crisp criterion on a si ngle feature, or a subset of features, is
used to evaluate their suitability for classiﬁcation. This method is independent of the speciﬁcalgorithm being used.
2.Wrapper Models: In these cases, the feature selection process is embedded into a classiﬁca-
tion algorithm, in order to make the feature selection process sensitive to the classiﬁcation
algorithm. This approach recognizes the fact that different algorithms may work better with
different features.
In order to perform feature selection with ﬁlter models, a number of different measures are used
in order to quantify the relevance of a feature to the classiﬁcation process. Typically, these measurescompute the imbalance of the feature values over different ranges of the attribute, which may eitherbe discrete or numerical. Some examples are as follows:
1This feature is used to measure prostate cancer in men.

An Introduction to Data Classiﬁcation 5
•Gini Index: Let p1…pkbe the fraction of classes that correspond to a particular value of the
discrete attribute. Then, the gini-index of that value of the discrete attribute is given by:
G=1−k
∑
i=1p2
i (1.1)
The value of Granges between 0 and 1 −1/k. Smaller values are more indicative of class
imbalance. This indicates that the feature value is more discriminative for classiﬁcation. The
overall gini-index for the attribute can be measured by weighted averaging over different
values of the discrete attribute, or by using the maximum gini-index over any of the differentdiscrete values. Different strategies may be more desirable for different scenarios, though the
weighted average is more commonly used.
•Entropy: The entropy of a particular value of the discrete attribute is measured as follows:
E=− k
∑
i=1pi·log(pi) (1.2)
The same notations are used above, as for the case of the gini-index. The value of the entropy
lies between 0 and log (k), with smaller values being more indicative of class skew.
•Fisher’s Index: The Fisher’s index measures the ratio of the between class scatter to the within
class scatter. Therefore, if pjis the fraction of training examples belonging to class j,µjis
the mean of a particular feature for class j,µis the global mean for that feature, and σjis
the standard deviation of that feature for class j, then the Fisher score Fcan be computed as
follows:
F=∑k
j=1pj·(µj−µ)2
∑k
j=1pj·σ2
j(1.3)
A wide variety of other measures such as the χ2-statistic and mutual information are also available in
order to quantify the discriminative power of attributes. An approach known as the Fisher’s discrim-inant [61] is also used in order to combine the different features into directions in the data that are
highly relevant to classiﬁcation. Such methods are of course feature transformation methods, which
are also closely related to feature selection met hods, just as unsupervised dimensionality reduction
methods are related to unsupervised feature selection methods.
The Fisher’s discriminant will be explained below for the two-class problem. Let
µ0and
µ1be
thed-dimensional row vectors representing the m eans of the records in the two classes, and let Σ0
andΣ1be the corresponding d×dcovariance matrices, in which the (i,j)th entry represents the
covariance between dimensions iand jfor that class. Then, the equivalent Fisher score FS(
V)for a
d-dimensional row vector
Vmay be written as follows:
FS(
V)=(
V·(
µ0−
µ1))2
V(p0·Σ0+p1·Σ1)
VT(1.4)
This is a generalization of the axis-parallel score in Equation 1.3, to an arbitrary direction
V.T h e
goal is to determine a direction
V, which maximizes the Fisher score. It can be shown that the
optimal direction
V∗may be determined by solving a generalized eigenvalue problem, and is given
by the following expression:
V∗=(p0·Σ0+p1·Σ1)−1(
µ0−
µ1)T(1.5)
If desired, successively orthogonal directions may b e determined by iteratively projecting the data
onto the residual subspace, after determi ning the optimal directions one by one.

6 Data Classiﬁcation: Algorithms and Applications
More generally, it should be pointed out that many features are often closely correlated with
one another, and the additional utility of an attribute, once a certain set of features have already
been selected, is different from its standalone utility. In order to address this issue, the Minimum
Redundancy Maximum Relevance approach was proposed in [69], in which features are incremen-
tally selected on the basis of their incremental gain on adding them to the feature set. Note that this
method is also a ﬁlter model, since the evaluation is on a subset of features, and a crisp criterion is
used to evaluate the subset.
In wrapper models, the feature selection phase is embedded into an iterative approach with a
classiﬁcation algorithm. In each iteration, the classi ﬁcation algorithm evaluates a particular set of
features. This set of features is then augmented usi ng a particular (e.g., greedy) strategy, and tested
to see of the quality of the classiﬁcation improves. Since the classiﬁcation algorithm is used for
evaluation, this approach will generally create a feature set, which is sensitive to the classiﬁcation
algorithm. This approach has been found to be usef ul in practice, because of the wide diversity of
models on data classiﬁcation. For example, an SVM would tend to prefer features in which the twoclasses separate out using a linear model, whereas a nearest neighbor classiﬁer would prefer featuresin which the different classes are clustered into spherical regions. A good survey on feature selection
methods may be found in [59]. Feature selection methods are discussed in detail in Chapter 2.
1.2.2 Probabilistic Methods
Probabilistic methods are the most fundamental among all data classiﬁcation methods. Proba-
bilistic classiﬁcation algorithms use statistical inference to ﬁnd the best class for a given example.In addition to simply assigning the best class like other classiﬁcation algorithms, probabilistic clas-
siﬁcation algorithms will output a corresponding posterior probability of the test instance being a
member of each of the possible classes. The posterior probability is deﬁned as the probability after
observing the speciﬁc characteristics of t he test instance. On the other hand, the prior probability
is simply the fraction of train ing records belonging to each partic ular class, with no knowledge of
the test instance. After obtaining the posterior probabilities, we use decision theory to determine
class membership for each new instance. Basically, there are two ways in which we can estimatethe posterior probabilities.
In the ﬁrst case, the posterior probability of a particular class is estimated by determining the
class-conditional probability and the prior class separately and then applying Bayes’ theorem to ﬁndthe parameters. The most well known among these i s the Bayes classiﬁer, which is known as a gen-
erative model. For ease in discussion, we will assume discrete feature values, though the approachcan easily be applied to numerical attributes with the use of discretization methods. Consider a testinstance with ddifferent features, which have values X=/angbracketleftx
1…xd/angbracketrightrespectively. Its is desirable to
determine the posterior probability that the class Y(T)of the test instance Tisi.I no t h e rw o r d s ,w e
wish to determine the posterior probability P(Y(T)= i|x1…x d). Then, the Bayes rule can be used
in order to derive the following:
P(Y(T)= i|x1…x d)= P(Y(T)= i)·P(x1…xd|Y(T)= i)
P(x1…x d)(1.6)
Since the denominator is constant across all classes, and one only needs to determine the class with
the maximum posterior probability, one can approximate the aforementioned expression as follows:
P(Y(T)= i|x1…xd)∝P(Y(T)= i)·P(x1…xd|Y(T)= i) (1.7)
The key here is that the expression on the right can be evaluated more easily in a data-driven
way, as long as the naive Bayes assumption is used for simpliﬁcation. Speciﬁcally, in Equation1.7,
the expression P(Y(T)= i|x1…x d)can be expressed as the product of t he feature-wise conditional

An Introduction to Data Classiﬁcation 7
probabilities.
P(x1…xd|Y(T)= i)=d
∏
j=1P(xj|Y(T)= i) (1.8)
This is referred to as conditional independence , and therefore the Bayes method is referred to as
“naive.” This simpliﬁcation is crucial, because these individual probabilities can be estimated from
the training data in a more robust way. The naive Bayes theorem is crucial in providing the ability
to perform the product-wise simpliﬁcation. The term P(xj|Y(T)= i)is computed as the fraction of
the records in the portion of the training data corresponding to the ith class, which contains feature
value xjfor the jth attribute. If desired, Laplacian smoothing can be used in cases when enough
data is not available to estimate these values robustly. This is quite often the case, when a small
amount of training data may contain few or no training records containing a particular feature value.The Bayes rule has been used quite successfully in the context of a wide variety of applications,
and is particularly popular in the context of text classiﬁcation. In spite of the naive independence
assumption, the Bayes model seems to be quite eff ective in practice. A detailed discussion of the
naive assumption in the context of the effectiveness of the Bayes classiﬁer may be found in [38].
Another probabilistic approach is to directly model the posterior probability, by learning a dis-
criminative function that maps an input feature vector directly onto a class label. This approach is
often referred to as a discriminative model. Logistic regression is a popular discriminative classiﬁer,
and its goal is to directly estimate the posterior probability P(Y(T)= i|X)from the training data.
Formally, the logistic regression model is deﬁned as
P(Y(T)= i|X)=1
1+e−θTX, (1.9)
whereθis the vector of parameters to be estimated. In general, maximum likelihood is used to deter-
mine the parameters of the logistic regression. To handle overﬁtting problems in logistic regression,regularization is introduced to penalize the log likelihood function for large values of θ. The logistic
regression model has been extensively used in numerous disciplines, including the Web, and the
medical and social science ﬁelds.
A variety of other probabilistic models are known in the literature, such as probabilistic graphical
models, and conditional random ﬁelds. An overview o f probabilistic methods for data classiﬁcation
are found in [20, 64]. Probabilistic methods for dat a classiﬁcation are discussed in Chapter 3.
1.2.3 Decision Trees
Decision trees create a hierarchical partitioning of the data, which relates the different partitions
at the leaf level to the different classes. The hierarchical partitioning at each level is created withthe use of a split criterion . The split criterion may either use a condition (or predicate) on a single
attribute, or it may contain a condition on multiple attributes. The former is referred to as a univari-
ate split, whereas the latter is referred to as a multivariate split. The overall approach is to try to
recursively split the training data so as to maximize the discrimination among the different classes
over different nodes. The discrimination among the different classes is maximized, when the level of
skew among the different classes in a given node is maximized. A measure such as the gini-index or
entropy is used in order to quantify this skew. For example, if p
1…pkis the fraction of the records
belonging to the kdifferent classes in a node N, then the gini-index G(N)of the node Nis deﬁned
as follows:
G(N)= 1−k
∑
i=1p2
i (1.10)
The value of G(N)lies between 0 and 1 −1/k. The smaller the value of G(N), the greater the skew.
In the cases where the classes are evenly balanced, the value is 1 −1/k. An alternative measure is

8 Data Classiﬁcation: Algorithms and Applications
TABLE 1.1 : Training Data Snapshot Relating Cardiovascular Risk Based on Previous Events to
Different Blood Parameters
Patient Name
CRP Level
Cholestrol
High Risk? (Class Label)
Mary
3.2
170
Y
Joe
0.9
273
N
Jack
2.5
213
Y
Jane
1.7
229
N
Tom
1.1
160
N
Peter
1.9
205
N
Elizabeth
8.1
160
Y
Lata
1.3
171
N
Daniela
4.5
133
Y
Eric
11.4
122
N
Michael
1.8
280
Y
the entropy E(N):
E(N)=−k
∑
i=1pi·log(pi) (1.11)
The value of the entropy lies2between 0 and log (k). The value is log (k), when the records are
perfectly balanced among the different classes. This corresponds to the scenario with maximum
entropy. The smaller the entropy, the greater the skew in the data. Thus, the gini-index and entropy
provide an effective way to evaluate the quality of a node in terms of its level of discrimination
between the different classes.
While constructing the training model, the split is performed, so as to minimize the weighted
sum of the gini-index or entropy of the two nodes. This step is performed recursively, until a ter-mination criterion is satisﬁed. The most obvious termination criterion is one where all data records
in the node belong to the same class. More generally, the termination criterion requires either a
minimum level of skew or purity, or a minimum number of records in the node in order to avoidoverﬁtting. One problem in decision tree construction is that there is no way to predict the best
time to stop decision tree growth, in order to prevent overﬁtting. Therefore, in many variations, thedecision tree is pruned in order to remove nodes tha t may correspond to overﬁtting. There are differ-
ent ways of pruning the decision tree. One way of pruning is to use a minimum description lengthprinciple in deciding when to prune a node from the tree. Another approach is to hold out a smallportion of the training data during the decision tree growth phase. It is then tested to see whether
replacing a subtree with a single node improves the classiﬁcation accuracy on the hold out set. If
this is the case, then the pruning is performed. In the testing phase, a test instance is assigned to an
appropriate path in the decision tree, based on the evaluation of the split criteria in a hierarchical
decision process. The class label of the corresponding leaf node is reported as the relevant one.
Figure 1.1 provides an example of how the decision tree is constructed. Here, we have illustrated
a case where the two measures (features) of the blood parameters of patients are used in order toassess the level of cardiovascular risk in the patient. The two measures are the C-Reactive Protein
(CRP) level and Cholesterol level, which are well known parameters related to cardiovascular risk.
It is assumed that a training data set is available, which is already labeled into high risk and low
risk patients, based on previous cardiovascular events such as myocardial infarctions or strokes. Atthe same time, it is assumed that the feature val ues of the blood parameters for these patients are
available. A snapshot of this data is illustrated in Table 1.1. It is evident from the training data that
2The value of the expression at pi=0 needs to be evaluated at the limit.

An Introduction to Data Classiﬁcation 9
CͲReactiveProtein (CRP) <2C ͲReactiveProtein (CRP) >2
Cholesterol< 250 Cholesterol>250
Cholesterol< 200 Cholesterol>200
Normal High Risk Normal High Risk
(a) Univariate Splits(a)Univariate Splits
CRP Ch l/100 4 CRP Ch l/100 4 CRP+Chol/100 <4 CRP+Chol/100 >4
Normal HighRisk
(b)Multivariate Splits
FIGURE 1.1 : Illustration of univariate and multivariate splits for decision tree construction.
higher CRP and Cholesterol levels correspond to greater risk, though it is possible to reach more
deﬁnitive conclusions by combining the two.
An example of a decision tree that constructs the classiﬁcation model on the basis of the two
features is illustrated in Figure 1.1(a). This decision tree uses univariate splits, by ﬁrst partitioning
on the CRP level, and then using a split criterion on the Cholesterol level. Note that the Cholesterol
split criteria in the two CRP branches of the tree are different. In principle, different features can
be used to split different nodes at the same level of the tree. It is also sometimes possible to use
conditions on multiple attributes in order to create more powerful splits at a particular level of the
tree. An example is illustrated in Figure 1.1(b), where a linear combination of the two attributesprovides a much more powerful split than a single attribute. The split condition is as follows:
CRP + Cholestrol/ 100≤4
Note that a single condition such as this is able to partition the training data very well into the
two classes (with a few exceptions). Therefore, the split is more powerful in discriminating between
the two classes in a smaller number of levels of the decision tree. Where possible, it is desirable
to construct more compact decision trees in order to obtain the most accurate results. Such splitsare referred to as multivariate splits. Some of t he earliest methods for decision tree construction
include C4.5 [72], ID3 [73], and CART [22]. A detailed discussion of decision trees may be foundin [22, 65, 72, 73]. Decision trees are discussed in Chapter 4.
1.2.4 Rule-Based Methods
Rule-based methods are closely related to decision trees, except that they do not create a strict
hierarchical partitioning of the training data. Rather, overlaps are allowed in order to create greaterrobustness for the training model. Any path in a decision tree may be interpreted as a rule, which

10 Data Classiﬁcation: Algorithms and Applications
assigns a test instance to a particular label. For example, for the case of the decision tree illustrated
in Figure 1.1(a), the rightmost path corresponds to the following rule:
CRP> 2& Cholestrol >200⇒HighRisk
It is possible to create a set of disjoint rules from t he different paths in the decision tree. In fact,
a number of methods such as C4.5 , create related models for bot h decision tree construction and
rule construction. The corresponding rule-based classiﬁer is referred to as C4.5Rules .
Rule-based classiﬁers can be viewed as more ge neral models than decision tree models. While
decision trees require the induced rule sets to be non-overlapping , this is not the case for rule-based
classiﬁers. For example, consider the following rule:CRP> 3⇒HighRisk
Clearly, this rule overlaps with the previous rule, and is also quite relevant to the prediction of a
given test instance. In rule-based methods, a set of rules is mined from the training data in the ﬁrst
phase (or training phase). During the testing phase, it is determined which rules are relevant to the
test instance and the ﬁnal result is based on a co mbination of the class values predicted by the
different rules.
In many cases, it may be possible to create rules that possibly conﬂict with one another on the
right hand side for a particular test instance. Therefore, it is important to design methods that can
effectively determine a resolution to these conﬂicts. The method of resolution depends upon whether
the rule sets are ordered or unordered. If the rule sets are ordered, then the top matching rules canbe used to make the prediction. If the rule sets are unordered, then the rules can be used to vote on
the test instance. Numerous methods such as Classiﬁcation based on Associations (CBA) [58], CN2
[31], and RIPPER [26] have been proposed in the literature , which use a variety of rule induction
methods, based on different ways of m ining and prioritizing the rules.
Methods such as CN2 and RIPPER use the sequential covering paradigm , where rules with
high accuracy and coverage are sequentially mined from the training data. The idea is that a rule is
grown corresponding to speciﬁc target class, and then all training instances matching (or covering )
the antecedent of that rule are removed. This appr oach is applied repeated ly, until only training
instances of a particular class remain in the data. This constitutes the default class , which is selected
for a test instance, when no rule is ﬁred. The process of mining a rule for the training data is referred
to as rule growth. The growth of a rule involves the successive addition of conjuncts to the left-hand
side of the rule, after the selection of a particular consequent class. This can be viewed as growing a
single “best” path in a decision tree, by adding conditions (split criteria) to the left-hand side of the
rule. After the rule growth phase, a rule-pruning phase is used, which is analogous to decision tree
construction. In this sense, the rule-growth of rule-based classiﬁers share a number of conceptual
similarities with decision tree classiﬁers. These rules are ranked in the same order as they are minedfrom the training data. For a given test instance, the class variable in the consequent of the ﬁrst
matching rule is reported. If no matching rule is found, then the default class is reported as the
relevant one.
Methods such as CBA [58] use the traditional association rule framework, in which rules are
determined with the use of speciﬁc support and conﬁ dence measures. Therefore, these methods are
referred to as associative classiﬁers. It is also relatively easy to prioritize these rules with the use of
these parameters. The ﬁnal classiﬁcation can be performed by either using the majority vote from
the matching rules, or by picking the top ranked rule(s) for classiﬁcation. Typically, the conﬁdence
of the rule is used to prioritize them, and the s upport is used to prune for statistical signiﬁcance.
A single catch-all rule is also created for test instances that are not covered by any rule. Typically,this catch-all rule might correspond to the majority class among training instances not coveredby any rule. Rule-based methods tend to be more robust than decision trees, because they are not

An Introduction to Data Classiﬁcation 11
restricted to a strict hierarchical partitioning of the data. This is most evident from the relative
performance of these methods in some sparse high dimensional domains such as text. For example,while many rule-based methods such as RIPPER are frequently used for the text domain, decision
trees are used rarely for text. Another advantage of these methods is that they are relatively easyto generalize to different data types such as se quences, XML or graph data [14, 93]. In such cases,
the left-hand side of the rule needs to be deﬁned in a way that is speciﬁc for that data domain. Forexample, for a sequence classiﬁcation problem [14], the left-hand side of the rule corresponds to asequence of symbols. For a graph-classiﬁcation problem, the left-hand side of the rule corresponds
to a frequent structure [93]. Therefore, while rule-based methods are related to decision trees, they
have signiﬁcantly greater expressive power. Rule-based methods are discussed in detail in Chapter 5.
1.2.5 Instance-Based Learning
In instance-based learning, the ﬁrst phase of constructing the training model is often dispensed
with. The test instance is directly related to the training instances in order to create a classiﬁcationmodel. Such methods are referred to as lazy learning methods , because they wait for knowledge of
the test instance in order to create a locally optimized model, which is speciﬁc to the test instance.
The advantage of such methods is that they can be directly tailored to the particular test instance,
and can avoid the information loss associated w ith the incompleteness of any training model. An
overview of instance-based methods may be found in [15, 16, 89].
An example of a very simple instance-based method is the nearest neighbor classiﬁer. In the
nearest neighbor classiﬁer, the top knearest neighbors in the training data are found to the given
test instance. The class label with the largest presence among the knearest neighbors is reported as
the relevant class label. If desired, the approach can be made faster with the use of nearest neighborindex construction. Many variations of the basic instance-based learning algorithm are possible,
wherein aggregates of the training instances ma y be used for classiﬁcation. For example, small
clusters can be created from the instances of each class, and the centroid of each cluster may be
used as a new instance. Such an approach is much more efﬁcient and also more robust because of
the reduction of noise associated with the clustering phase which aggregates the noisy records into
more robust aggregates. Other variations of instance-based learning use different variations on the
distance function used for classiﬁcation. For example, methods that are based on the Mahalanobis
distance or Fisher’s discrimi nant may be used for more accurate results. The problem of distance
function design is intimately rel ated to the problem of instance-ba sed learning. Therefore, separate
chapters have been devoted in this book to these topics.
A particular form of instance-based learning, is one where the nearest neighbor classiﬁer is
not explicitly used. This is because the distribution of the class labels may not match with the
notion of proximity deﬁned by a particular distance function. Rather, a locally optimized classiﬁer
is constructed using the examples in the neighborhood of a test instance. Thus, the neighborhood is
used only to deﬁne the neighborhood in which the classiﬁcation model is constructed in a lazy way.
Local classiﬁers are generally mo re accurate, because of the simpliﬁca tion of the class distribution
within the locality of the test instance. This approach is more generally referred to as lazy learning.
This is a more general notion of instance-based learning than traditional nearest neighbor classiﬁers.Methods for instance-based classiﬁcation are discussed in Chapter 6. Methods for distance-functionlearning are discussed in Chapter 18.
1.2.6 SVM Classiﬁers
SVM methods use linear conditions in order to sep arate out the classes from one another. The
idea is to use a linear condition that separates the two classes from each other as well as possible.Consider the medical example discussed earlier, where the risk of cardiovascular disease is related
to diagnostic features from patients.

12 Data Classiﬁcation: Algorithms and Applications
…
…
…
.
..
..
..
..
…
MARGIN…MARGIN
.
…
…
.
..
.. .
.
..
..
…..
MARGIN VIOLATION WITH PENALTY BASED SLACK VARIABLES MARGINVIOLATION WITHPENALTYͲBASEDSLACKVARIABLES
(a) Hard separation (b) Soft separation
FIGURE 1.2 : Hard and soft support vector machines.
CRP + Cholestrol/ 100≤4
In such a case, the split condition in the multivariate case may also be used as stand-alone con-
dition for classiﬁcation. This, a SVM classiﬁer, may be considered a single level decision tree with
a very carefully chosen multivari ate split condition. Clearly, since the effectiveness of the approach
depends only on a single separating hyperplane, it is critical to deﬁne this separation carefully.
Support vector machines are generally deﬁned for binary classiﬁcation problems. Therefore, the
class variable yifor the ith training instance
Xiis assumed to be drawn from {−1,+1}.T h em o s t
important criterion, which is commonly used for SVM classiﬁcation, is that of the maximum margin
hyperplane . In order to understand this point, consider the case of linearly separable data illustrated
in Figure 1.2(a). Two possible separating hyperplanes, with their corresponding support vectors and
margins have been illustrated in the ﬁgure. It is evident that one of the separating hyperplanes has a
much larger margin than the other, and is therefore more desirable because of its greater generality
for unseen test examples. Therefore, one of the important criteria for support vector machines is toachieve maximum margin separation of the hyperplanes.
In general, it is assumed for ddimensional data that the separating hyperplane is of the form
W·
X+b=0. Here
Wis a d-dimensional vector representing the coefﬁcients of the hyperplane of
separation, and bis a constant. Without loss of generality, it may be assumed (because of appropriate
coefﬁcient scaling) that the two symmetric support vectors have the form
W·
X+b=1a n d
W·
X+b=−1. The coefﬁcients
Wand bneed to be learned from the training data Din order to
maximize the margin of separation between these two parallel hyperplanes. It can shown fromelementary linear algebra that the distance between these two hyperplanes is 2 /||
W||. Maximizing
this objective function is equivalent to minimizing ||
W||2/2. The problem constraints are deﬁned by
the fact that the training data points for each clas s are on one side of the support vector. Therefore,
these constraints are as follows:
W·
Xi+b≥+1∀i:yi=+ 1 (1.12)
W·
Xi+b≤− 1∀i:yi=−1 (1.13)
This is a constrained convex quadratic optimization problem, which can be solved using Lagrangian
methods. In practice, an off-the-shelf optimization solver may be used to achieve the same goal.

An Introduction to Data Classiﬁcation 13
In practice, the data may not be linearly separable. In such cases, soft-margin methods may
be used. A slack ξi≥0 is introduced for training instance, and a training instance is allowed to
violate the support vector constraint, for a penalty, which is dependent on the slack. This situation
is illustrated in Figure 1.2(b). Therefore, the new set of constraints are now as follows:
W·
Xi+b≥+1−ξi∀i:yi=+ 1 (1.14)
W·
Xi+b≤− 1+ξi∀i:yi=−1 (1.15)
ξi≥0 (1.16)
Note that additional non-negativity constraints also need to be imposed in the slack variables. Theobjective function is now ||
W||2/2+C·∑n
i=1ξi. The constant Cregulates the importance of the
margin and the slack requirements. In other words, small values of Cmake the approach closer to
soft-margin SVM, whereas large values of Cmake the approach more of the hard-margin SVM. It
is also possible to solve this problem using off-the-shelf optimization solvers.
It is also possible to use transformations on the feature variables in order to design non-linear
SVM methods. In practice, non-linear SVM methods are learned using kernel methods. The key idea
here is that SVM formulations can be solved using only pairwise dot products (similarity values)between objects. In other words, the optimal decision about the class label of a test instance, from
the solution to the quadratic optimization problem in this section, can be expressed in terms of the
following:
1. Pairwise dot products of different training instances.
2. Pairwise dot product of the test instance and different training instances.
The reader is advised to refer to [84] for the speciﬁc details of the solution to the optimization
formulation. The dot product between a pair of instances can be viewed as notion of similarity
among them. Therefore, the aforementioned observations imply that it is possible to perform SVM
classiﬁcation, with pairwise similarity information between training data pairs and training-test data
pairs. The actual feature values are not required.
This opens the door for using transformations, which are represented by their similarity values.
These similarities can be viewed as kernel functions K(
X,
Y), which measure similarities between
the points
Xand
Y. Conceptually, the kernel function may be viewed as dot product between the
pair of points in a newly transformed space (denoted by mapping function Φ(·)). However, this
transformation does not need to be explicitly co mputed, as long as the kernel function (dot product)
K(
X,
Y)is already available:
K(
X,
Y)=Φ(
X)·Φ(
Y) (1.17)
Therefore, all computations can be performed in the original space using the dot products implied
by the kernel function. Some interesting examples of kernel functions include the Gaussian radialbasis function, polynomial kernel, and hyperbolic tangent, which are listed below in the same order.
K(
Xi,
Xj)=e−||
Xi−
Xj||2/2σ2(1.18)
K(
Xi,
Xj)= (
Xi·
Xj+1)h(1.19)
K(
Xi,
Xj)=tanh(κ
Xi·
Xj−δ) (1.20)
These different functions result in different kinds of nonlinear decision boundaries in the original
space, but they correspond to a linear separator i n the transformed space. The performance of a
classiﬁer can be sensitive to the choice of the kernel used for the transformation. One advantage
of kernel methods is that they can also be extended to arbitrary data types, as long as appropriate
pairwise similarities can be deﬁned.

14 Data Classiﬁcation: Algorithms and Applications
The major downside of SVM methods is that they are slow. However, they are very popular and
tend to have high accuracy in many practical domai ns such as text. An introduction to SVM methods
may be found in [30, 46, 75, 76, 85]. Kernel methods for support vector machines are discussed
in [75]. SVM methods are discussed in detail in Chapter 7.
1.2.7 Neural Networks
Neural networks attempt to simulate biological systems, corresponding to the human brain. In
the human brain, neurons are connected to one another via points, which are referred to as synapses .
In biological systems, learning is performed by changing the strength of the synaptic connections,
in response to impulses.
This biological analogy is retained in an artiﬁcial neural network. The basic computation unit
in an artiﬁcial neural network is a neuron orunit. These units can be arranged in different kinds
of architectures by connections between them. The most basic architecture of the neural network
is a perceptron, which contains a set of input nodes and an output node. T he output unit receives
a set of inputs from the input units. There are ddifferent input units, which is exactly equal to
the dimensionality of the underlying data. The data is assumed to be numerical. Categorical data
may need to be transformed to binary representations, and therefore the number of inputs may be
larger. The output node is associated with a set of weights
W, which are used in order to compute a
function f(·)of its inputs. Each component of the weight vector is associated with a connection from
the input unit to the output unit. The weights can be viewed as the analogue of the synaptic strengthsin biological systems. In the case of a perceptron architecture, the input nodes do not perform any
computations. They simply transmit the input attribute forward. Computations are performed only
at the output nodes in the basic perceptron architecture. The output node uses its weight vector alongwith the input attribute values in order to compute a function of the inputs. A typical function, which
is computed at the output nodes, is the signed linear function:
z
i=sign{
W·
Xi+b} (1.21)
The output is a predicted value of the binary class variable, which is assumed to be drawn from
{−1,+1}. The notation bdenotes the bias. Thus, for a vector
Xidrawn from a dimensionality of d,
the weight vector
Wshould also contain delements. Now consider a binary classiﬁcation problem,
in which all labels are drawn from {+1,−1}. We assume that the class label of
Xiis denoted by yi.
In that case, the sign of the predicted function ziyields the class label. An example of the perceptron
architecture is illustrated in Figure 1.3(a). Thus, the goal of the approach is to learn the set of
weights
Wwith the use of the training data, so as to minimize the least squares error (yi−zi)2.T h e
idea is that we start off with random weights and gradually update them, when a mistake is madeby applying the current function on the training example. The magnitude of the update is regulatedby a learning rate λ. This update is similar to the updates in gradient descent, which are made for
least-squares optimization. In the case of neural networks, the update function is as follows.
Wt+1=
Wt+λ(yi−zi)
Xi (1.22)
Here,
Wtis the value of the weight vector in the tth iteration. It is not difﬁcult to show that the
incremental update vector is related to the negative gradient of (yi−zi)2with respect to
W.I ti sa l s o
easy to see that updates are made to the weights, only when mistakes are made in classiﬁcation.When the outputs are correct, the incremental change to the weights is zero.
The similarity to support vector machines is quite striking, in the sense that a linear function
is also learned in this case, and the sign of the linear function predicts the class label. In fact, theperceptron model and support vector machines are closely related, in that both are linear functionapproximators. In the case of support vector machines, this is achieved with the use of maximum
margin optimization. In the case of neural networks, this is achieved with the use of an incremental

An Introduction to Data Classiﬁcation 15
INPUTNODES
Xi2Xi1
єOUTPUTNODE w1
w2
Xi3iє Zi w3
w4
Xi4INPUTLAYER
Xi2Xi1
HIDDENLAYER
OUTPUT LAYER
Xi3i
ZiOUTPUTLAYER
Xi4
(a) Perceptron (b) Multilayer
FIGURE 1.3 : Single and multilayer neural networks.
learning algorithm, which is approximately equivalent to least squares error optimization of the
prediction.
The constant λregulates the learning rate. The choice of learning rate is sometimes important,
because learning rates that are too small will result in very slow training. On the other hand, if thelearning rates are too fast, this will result in os cillation between suboptimal solutions. In practice,
the learning rates are fast initially, and then allowed to gradually slow down over time. The idea here
is that initially large steps are likely to be helpful, but are then reduced in size to prevent oscillation
between suboptimal solutions. For example, after titerations, the learning rate may be chosen to be
proportional to 1 /t.
The aforementioned discussion was based on th e simple perceptron architecture, which can
model only linear relationships. In practice, the neural network is arranged in three layers, referred
to as the input layer ,hidden layer ,a n dt h e output layer . The input layer only transmits the inputs
forward, and therefore, there are really only two layers to the neural network, which can performcomputations. Within the hidden layer, there can be any number of layers of neurons. In such cases,there can be an arbitrary number of layers in the neural network. In practice, there is only one hidden
layer, which leads to a 2-layer network. An example of a multilayer network is illustrated in Figure
1.3(b). The perceptron can be viewed as a very special kind of neural network, which contains only
a single layer of neurons (corre sponding to the output node). Multilayer neural networks allow the
approximation of nonlinear functions, and complex decision boundaries, by an appropriate choice
of the network topology, and non-linear functions at the nodes. In these cases, a logistic or sigmoid
function known as a squashing function is also applied to the inputs of neurons in order to model
non-linear characteristics. It is possible to use different non-linear functions at different nodes. Such
general architectures are very powerful in approximating arbitrary functions in a neural network,
given enough training data and training time. This is the reason that neural networks are sometimes
referred to as universal function approximators .
In the case of single layer perceptron algorthms, the training process is easy to perform by using
a gradient descent approach. The major challenge in training multilayer networks is that it is nolonger known for intermediate (hidden layer) nodes, what their “expected” output should be. This is
only known for the ﬁnal output node. Therefore, some k ind of “error feedback” is required, in order
to determine the changes in the weights at the intermediate nodes. The tra ining process proceeds in
two phases, one of which is in the forward direction, and the other is in the backward direction.
1.F orward Phase: In the forward phase, the activation function is repeatedly applied to prop-
agate the inputs from the neural network in the forward direction. Since the ﬁnal output is
supposed to match the class label, the ﬁnal output at the output layer provides an error value,depending on the training label value. This error is then used to update the weights of the
output layer, and propagate the weight updates backwards in the next phase.

16 Data Classiﬁcation: Algorithms and Applications
2.Backpropagation Phase: In the backward phase, the errors are propagated backwards through
the neural network layers. This leads to the updating of the weights in the neurons of the
different layers. The gradients at the previous layers are learned as a function of the errors
and weights in the layer ahead of it. The learning rate λplays an important role in regulating
the rate of learning.
In practice, any arbitrary function can be approximated well by a neural network. The price of thisgenerality is that neural networks are often quite slow in practice. They are also sensitive to noise,
and can sometimes overﬁt the training data.
The previous discussion assumed only binary labels. It is possible to create a k-label neural net-
work, by either using a multiclass “one-versus-all” meta-algorithm, or by creating a neural networkarchitecture in which the number of output nodes is equal to the number of class labels. Each outputrepresents prediction to a particular label value. A number of implementations of neural network
methods have been studied in [35,57,66,77,88], and many of these implementations are designed in
the context of text data. It should be pointed out that both neural networks and SVM classiﬁers use a
linear model that is quite similar. The main difference between the two is in how the optimal linear
hyperplane is determined. Rather than using a direct optimization methodology, neural networksuse a mistake-driven approach to data classiﬁcation [35]. Neu ral networks are described in detail
in [19, 51]. This topic is addressed in detail in Chapter 8.
1.3 Handing Different Data Types
Different data types require the use of different techniques for data classiﬁcation. This is be-
cause the choice of data type often qualiﬁes the kind of problem that is solved by the classiﬁcationapproach. In this section, we will discuss the different data types commonly studied in classiﬁcationproblems, which may require a certain level of special handling.
1.3.1 Large Scale Data: Big Data and Data Streams
With the increasing ability to collect different types of large scale data, the problems of scale
have become a challenge to the classiﬁcation process. Clearly, larger data sets allow the creationof more accurate and sophisticated models. However, this is not necessarily helpful, if one is com-
putationally constrained by problems of scale. Data streams and big data analysis have different
challenges. In the former case, real time processi ng creates challenges, whereas in the latter case,
the problem is created by the fact that computa tion and data access over extremely large amounts
of data is inefﬁcient. It is often difﬁcult to com pute summary statistics from large volumes, because
the access needs to be done in a distributed way, and it is too expensive to shufﬂe large amounts of
data around. Each of these challenges will be discussed in this subsection.
1.3.1.1 Data Streams
The ability to continuously coll ect and process large volumes of data has lead to the popularity
of data streams [4]. In the streaming scenario, two primary problems arise in the construction of
training models.
•One-pass Constraint: Since data streams have very large volume, all processing algorithms
need to perform their computations in a single pass over the data. This is a signiﬁcant chal-
lenge, because it excludes the use of many iterative algorithms that work robustly over static
data sets. Therefore, it is crucial to design the training models in an efﬁcient way.

An Introduction to Data Classiﬁcation 17
•Concept Drift: The data streams are typically created by a generating process, which may
change over time. This results in concept drift, which corresponds to changes in the underly-
ing stream patterns over time. The presence of concept drift can be detrimental to classiﬁca-
tion algorithms, because models become stale ove r time. Therefore, it is crucial to adjust the
model in an incremental way, so that it achieve s high accuracy over current test instances.
•Massive Domain Constraint: The streaming scenario often contains discrete attributes that
take on millions of possible values. This is becau se streaming items are often associated with
discrete identiﬁers. Examples could be email a ddresses in an email addresses, IP addresses
in a network packet stream, and URLs in a click stream extracted from proxy Web logs.The massive domain problem is ubiquitous in streaming applications. In fact, many synopsis
data structures, such as the count-min sketch [33], and the Flajolet-Martin data structure [41],have been designed with this issue in mind. While this issue has not been addressed very
extensively in the stream mining literature (beyond basic synopsis methods for counting),
recent work has made a number of a dvances in this direction [9].
Conventional classiﬁcation algorithms need to be appropriately modiﬁed in order to address the
aforementioned challenges. The special scenario s, such as those in which the domain of the stream
data is large, or the classes are rare, pose special challenges. Most of the well known techniquesfor streaming classiﬁcation use space-efﬁcient da ta structures for easily updatable models [13, 86].
Furthermore, these methods are explicitly designe d to handle concept drift by making the models
temporally adaptive, or by using different models over different regions of the data stream. Spe-
cial scenarios or data types need dedicated methods in the streaming scenario. For example, the
massive-domain scenario can be addressed [9] by incorporating the count-min data structure [33] as
a synopsis structure within the training model. A specially difﬁcult case is that of rare class learn-ing, in which rare class instances may be mixed with occurrences of completely new classes. This
problem can be considered a hybrid between classiﬁcation and outlier detection. Nevertheless it is
the most common case in the streaming domain, in applications such as intrusion detection. In these
cases, some kinds of rare classes (intrusions) may already be known, whereas other rare classes may
correspond to previously unseen threats. A book on data streams, containing extensive discussions
on key topics in the area, may be found in [4]. The different variations of the streaming classiﬁcation
problem are addressed in detail in Chapter 9.
1.3.1.2 The Big Data Framework
While streaming algorithms work under the assumption that the data is too large to be stored
explicitly, the big data framework leverages advances in storage technology in order to actually store
the data and process it. However, as the subsequent discussion will show, even if the data can be
explicitly stored, it is often not easy to process and extract insights from it.
In the simplest case, the data is stored on disk o n a single machine, and it is desirable to scale
up the approach with disk-efﬁcient algorithms. While many methods such as the nearest neighborclassiﬁer and associative classiﬁers can be scaled up with more efﬁcient subroutines, other methods
such as decision trees and SVMs require dedicated methods for scaling up. Some examples of scal-
able decision tree methods include SLIQ [48], BOAT [42], and RainF orest [43]. Some early parallel
implementations of decision trees include the SPRINT method [82]. Typically, scalable decision tree
methods can be performed in one of two ways. Methods such as RainF orest increase scalability by
storing attribute-wise summaries of the training data. These summaries are sufﬁcient for performing
single-attribute splits efﬁciently. Methods such as BOAT use a combination of bootstrapped samples,
in order to yield a decision tree, which is very clo se to the accuracy that one would have obtained
by using the complete data.
An example of a scalable SVM method is SVMLight [53]. This approach focusses on the fact
that the quadratic optimization problem in SVM is computationally intensive. The idea is to always

18 Data Classiﬁcation: Algorithms and Applications
optimize only a small working set of variables while keeping the others ﬁxed. This working set is
selected by using a steepest descent criterion. Th is optimizes the advantage gained from using a
particular subset of attributes. Another strategy used is to discard training examples, which do not
have any impact on the margin of the classiﬁers. Training examples that are away from the decision
boundary, and on its “correct” side, have no impact on the margin of the classiﬁer, even if they are
removed. Other methods such as SVMPerf [54] reformulate the SVM optimization to reduce the
number of slack variables, and increase the number of constraints. A cutting plane approach, which
works with a small subset of constraints at a time, is used in order to solve the resulting optimization
problem effectively.
Further challenges arise for extremely large data sets. This is because an increasing size of the
data implies that a distributed ﬁle system must be used in order to store it, and distributed processingtechniques are required in order to ensure sufﬁcient scalability. The challenge here is that if large
segments of the data are available on different machines, it is often too expensive to shufﬂe the data
across different machines in order to extract integrated insights from it. Thus, as in all distributed
infrastructures, it is desirable to exchange intermediate insights, so as to minimize communicationcosts. For an application programmer, this can sometimes create challenges in terms of keeping
track of where different parts of the data are stored, and the precise ordering of communications inorder to minimize the costs.
In this context, Google’s MapReduce framework [37] provides an effective method for analysis
of large amounts of data, especially when the nature of the computations involve linearly computablestatistical functions over the elements of the data s treams. One desirable as pect of this framework is
that it abstracts out the precise details of where different parts of the data are stored to the applica-tion programmer. As stated in [37]: “ The run-time system takes care of the details of partitioning the
input data, scheduling the program’s execution acr oss a set of machines, handling machine failures,
and managing the required inter-machine communication. This allows programmers without anyexperience with parallel and distributed systems to easily utilize the resources of a large distributedsystem. ” Many classiﬁcation algorithms such as k-means are naturally linear in terms of their scala-
bility with the size of the data. A primer on the MapReduce framework implementation on Apache
Hadoop may be found in [87]. The key idea here is to use a Map function in order to distribute the
work across the different machines, and then prov ide an automated way to shufﬂe out much smaller
data in (key,value) pairs containing intermediate results. The Reduce function is then applied to the
aggregated results from the Map step in order to obtain the ﬁnal results.
Google’s original MapReduce framework was designed for analyzing large amounts of Web
logs, and more speciﬁcally deriving linearly computable statistics from the logs. It has been shown[44] that a declarative framework is particularly useful in many MapReduce applications, and that
many existing classiﬁcation algorithms can be generalized to the MapReduce framework. A proper
choice of the algorithm to adapt to the MapReduce framework is crucial, since the framework is
particularly effective for linear computations. It should be pointed out that the major attraction oftheMapReduce framework is its ability to provide application programmers with a cleaner abstrac-
tion, which is independent of very speciﬁc run-time details of the distributed system. It should not,
however, be assumed that such a system is someho w inherently superior to existing methods for dis-
tributed parallelization from an effectiveness orﬂexibility perspective, especially if an application
programmer is willing to design such details from scratch. A detailed discussion of classiﬁcationalgorithms for big data is provided in Chapter 10.
1.3.2 Text Classiﬁcation
One of the most common data types used in the context of classiﬁcation is that of text data. Text
data is ubiquitous, especially because of its popularity, both on the Web and in social networks.While a text document can be treated as a string of words, it is more commonly used as a bag-
of-words, in which the ordering information between words is not used. This representation of

An Introduction to Data Classiﬁcation 19
text is much closer to multidimensional data. Howe ver, the standard methods for multidimensional
classiﬁcation often need to be modiﬁed for text.
The main challenge with text classiﬁcation is that the data is extremely high dimensional and
sparse. A typical text lexicon may be of a size of a hundred thousand words, but a document may
typically contain far fewer words. Thus, most of the attribute values are zero, and the frequencies are
relatively small. Many common words may be very noisy and not very discriminative for the clas-
siﬁcation process. Therefore, th e problems of feature selection and representation are particularly
important in text classiﬁcation.
Not all classiﬁcation methods are equally popular for text data. For example, rule-based meth-
ods, the Bayes method, and SVM classiﬁers tend to be more popular than other classiﬁers. Somerule-based classiﬁers such as RIPPER [26] were originally designed for text classiﬁcation. Neural
methods and instance-based methods are also sometimes used. A popular instance-based methodused for text classiﬁcation is Rocchio’s method [56, 74]. Instance-based methods are also some-
times used with centroid-based classiﬁcation, where frequency-truncated centroids of class-speciﬁc
clusters are used, instead of the original documents for the k-nearest neighbor approach. This gen-
erally provides better accuracy, b ecause the centroid of a small clo sely related set of documents is
often a more stable representation of that data locality than any single document. This is especiallytrue because of the sparse nature of text data, in which two related documents may often have only
a small number of words in common.
Many classiﬁers such as decision trees, which are popularly used in other data domains, are
not quite as popular for text data. The reason for this is that decision trees use a strict hierarchical
partitioning of the data. Therefore, the features at the higher levels of the tree are implicitly given
greater importance than other features. In a text collection containing hundreds of thousands offeatures (words), a single word usually tells us very little about the class label. Furthermore, a
decision tree will typically partition the data space with a very small number of splits. This is aproblem, when this value is orde rs of magnitude less than the unde rlying data dimensionality. Of
course, decision trees in text are not very bala nced either, because of the fact that a given word
is contained only in a small subset of the documents. Consider the case where a split correspondsto presence or absence of a word. Because of the imbalanced nature of the tree, most paths from
the root to leaves will correspond to word-absence decisions, and a very small number (less than
5 to 10) word-presence decisions. Clearly, this will lead to poor classiﬁcation, especially in caseswhere word-absence does not convey much information, and a modest number of word presence
decisions are required. Univariate decision trees do not work very well for very high dimensional
data sets, because of disproportionate importanc e to some features, and a corresponding inability to
effectively leverage all the available features. It is possible to improve the effectiveness of decision
trees for text classiﬁcation by using multivari ate splits, though this can be rather expensive.
The standard classiﬁcation methods, which are used for the text domain, also need to be suitably
modiﬁed. This is because of the high dimensional and sparse nature of the text domain. For example,
text has a dedicated model, known as the multinomial Bayes model, which is different from the
standard Bernoulli model [12]. The Bernoulli model treats the presence and absence of a word in
a text document in a symmetric way. However, in a given text document, only a small fraction
of the lexicon size is present in it. The absence of a word is usually far less informative than thepresence of a word. The symmetric treatment of word presence and word absence can sometimes be
detrimental to the effectiveness of a Bayes classiﬁer in the text domain. In order to achieve this goal,the multinomial Bayes model is used, which uses the frequency of word presence in a document,
but ignores non-occurrence.
In the context of SVM classiﬁers, scalability is important, because such classiﬁers scale poorly
both with number of training documents and data dimensionality (lexicon size). Furthermore, thesparsity of text (i.e., few non-zero feature values) should be used to improve the training efﬁciency.
This is because the training model i n an SVM classiﬁer is constructed using a constrained quadratic
optimization problem, which has as many constraints as the number of data points. This is rather

20 Data Classiﬁcation: Algorithms and Applications
large, and it directly results in an increased size of the corresponding Lagrangian relaxation. In the
case of kernel SVM, the space-requirements for the kernel matrix could also scale quadratically withthe number of data points. A few methods such as SVMLight [53] address this issue by carefully
breaking down the problem into smaller subproblems, and optimizing only a few variables at a time.Other methods such as SVMPerf [54] also leverage the sparsity of the text domain. The SVMPerf
method scales as O(n·s),w h e r e sis proportional to the average number of non-zero feature values
per training document.
Text classiﬁcation often needs to be performed in scenarios, where it is accompanied by linked
data. The links between documents are typically inherited from domains such as the Web and socialnetworks. In such cases, the links contain useful information, which should be leveraged in theclassiﬁcation process. A number of techniques have recently been designed to utilize such side
information in the classiﬁcation process. Detailed surveys on text classiﬁcation may be found in
[12, 78]. The problem of text classiﬁcation is discussed in detail in Chapter 11 of this book.
1.3.3 Multimedia Classiﬁcation
With the increasing popularity of social media s ites, multimedia data ha s also become increas-
ingly popular. In particular sites such as Flickr orY outube allow users to upload their photos or
videos at these sites. In such cases, it is desirable to perform classiﬁcation of either portions or all
of either a photograph or a video. In these cases, rich meta-data may also be available, which canfacilitate more effective data classiﬁcation. The issue of data representation is a particularly impor-tant one for multimedia data, because poor representa tions have a large semantic gap, which creates
challenges for the classiﬁcation process. The combination of text with multimedia data in order tocreate more effective classiﬁcation models has been discussed in [8]. Many methods such as semi-supervised learning and transfer learning can also be used in order to improve the effectiveness of
the data classiﬁcation process. Multimedia data poses unique challenges, both in terms of data repre-sentation, and information fusi on. Methods for multimedia data cla ssiﬁcation are discussed in [60].
A detailed discussion of methods for multimedia data classiﬁcation is provided in Chapter 12.
1.3.4 Time Series and Sequence Data Classiﬁcation
Both of these data types are temporal data types in which the attributes are of two types. The ﬁrst
type is the contextual attribute (time), and the second attribute, which corresponds to the time seriesvalue, is the behavioral attribute. The main difference between time series and sequence data is that
time series data is continuous, whereas sequence data is discrete. Nevertheless, this difference is
quite signiﬁcant, because it changes the nature of the commonly used models in the two scenarios.
Time series data is popular in many applications such as sensor networks, and medical informat-
ics, in which it is desirable to use large volumes of streaming time series data in order to performthe classiﬁcation. Two kinds of classiﬁcation are possible with time-series data:
•Classifying speciﬁc time-instants: These correspond to speciﬁc events that can be inferred at
particular instants of the data stream. In these cases, the labels are associated with instants in
time, and the behavior of one or more time series are used in order to classify these instants.For example, the detection of signiﬁcant events in real-time applications can be an important
application in this scenario.
•Classifying part or whole series: In these cases, the class labels are associated with portions
or all of the series, and these are used for classi ﬁcation. For example, an ECG time-series will
show characteristic shapes for speciﬁc diagnostic criteria for diseases.

An Introduction to Data Classiﬁcation 21
Both of these scenarios are equally important from the perspective of analytical inferences in a wide
variety of scenarios. Furthermor e, these scenarios are also relevant to the case of sequence data.
Sequence data arises frequently in biological, Web log mining, and system analysis applications.
The discrete nature of the underlying data necess itates the use of methods that are quite different
from the case of continuous time series data. For example, in the case of discrete sequences, the
nature of the distance functions and modeling methodologies are quite different than those in time-
series data.
A brief survey of time-series and sequence classiﬁcation methods may be found in [91]. A
detailed discussion on time-series data classiﬁca tion is provided in Chapter 13, and that of sequence
data classiﬁcation methods is provided in Chapter 14. While the two areas are clearly connected,there are signiﬁcant differences between these two topics, so as to merit separate topical treatment.
1.3.5 Network Data Classiﬁcation
Network data is quite popular in Web and social networks applications in which a variety of
different scenarios for node classiﬁcation arise. In most of these scenarios, the class labels are asso-ciated with nodes in the underlying network. In many cases, the labels are known only for a subset
of the nodes. It is desired to use the known subset of labels in order to make predictions about nodes
for which the labels are unknown. This problem is also referred to as collective classiﬁcation .I nt h i s
problem, the key assumption is that of homophily . This implies that edges imply similarity relation-
ships between nodes. It is assumed that the labels vary smoothly over neighboring nodes. A variety
of methods such as Bayes methods and spectral methods have been generalized to the problem of
collective classiﬁcation. In cases where content information is available at the nodes, the effective-
ness of classiﬁcation can be improved even further. A detailed survey on collective classiﬁcationmethods may be found in [6].
A different form of graph classiﬁcation is one in which many small graphs exist, and labels are
associated with individual graphs. Such cases arise commonly in the case of chemical and biolog-
ical data, and are discussed in detail in [7]. The focus of the chapter in this book is on very large
graphs and social networks because of their recen t popularity. A detailed discussion of network
classiﬁcation methods is provided in Chapter 15 of this book.
1.3.6 Uncertain Data Classiﬁcation
Many forms of data collection are uncertain in nature. For example, data collected with the use
of sensors is often uncertain. Furthermore, in cases when data perturbation techniques are used, thedata becomes uncertain. In some cases, statistical methods are used in order to infer parts of the
data. An example is the case of link inference in network data. Uncertainty can play an important
role in the classiﬁcation of uncertain data. For example, if an attribute is known to be uncertain,its contribution to the training model can be de-emphasized, with respect to an attribute that has
deterministic attributes.
The problem of uncertain data classiﬁcation was ﬁrst studied in [5]. In these methods, the un-
certainty in the attributes is used as a ﬁrst-class variable in order to improve the effectiveness ofthe classiﬁcation process. This is because the relative importance o f different features depends not
only on their correlation with the class variable, but also the uncertainty inherent in them. Clearly,when the values of an attribute are more uncertain, it is less desirable to use them for the classiﬁca-
tion process. This is achieved in [5] with the use of a density-based transform that accounts for thevarying level of uncertainty of attributes. Subsequently, many other methods have been proposed to
account for the uncertainty in the attributes during the classiﬁcation process. A detailed description
of uncertain data classiﬁcation methods is provided in Chapter 16.

22 Data Classiﬁcation: Algorithms and Applications
1.4 Variations on Data Classiﬁcation
Many natural variations of the data classiﬁcation problem correspond to either small variations
of the standard classiﬁcation problem or are enhancements of classiﬁcation with the use of additional
data. The key variations of the classiﬁcation probl em are those of rare-class learning and distance
function learning. Enhancements of the data classiﬁcation problem make use of meta-algorithms,
more data in methods such as transfer learning and co-training, active learning, and human interven-
tion in visual learning. In addition, the topic of model evaluation is an important one in the context of
data classiﬁcation. This is because the issue of mode l evaluation is important for the design of effec-
tive classiﬁcation meta-algorithms. In the following section, we will discuss the different variationsof the classiﬁcation problem.
1.4.1 Rare Class Learning
Rare class learning is an important variation of the classiﬁcation problem, and is closely related
to outlier analysis [1]. In fact, it can be considered a supervised variation of the outlier detectionproblem. In rare class learning, the distribution o f the classes is highly imbalanced in the data, and
it is typically more important to correctly determine the positive class. For example, consider the
case where it is desirable to classify patients into malignant and normal categories. In such cases,
the majority of patients may be normal, though it is typically much more costly to misclassify atruly malignant patient (false negative). Thus, false negatives are more costly than false positives.
The problem is closely related to cost-sensitive learning, since the misclassiﬁcation of different
classes has different classes. The major differen ce with the standard classiﬁcation problem is that
the objective function of the problem needs to be modiﬁed with costs. This provides several avenuesthat can be used in order to effectively solve this problem:
•Example Weighting: In this case, the examples are weighted differently, depending upon their
cost of misclassiﬁcation. This leads to minor changes in most classiﬁcation algorithms, whichare relatively simple to implement. For example, in an SVM classiﬁer, the objective function
needs to be appropriately weighted with costs, whereas in a decision tree, the quantiﬁcation of
the split criterion needs to weight the examples with costs. In a nearest neighbor classiﬁer, theknearest neighbors are appropriately weighted while determining the class with the largest
presence.
•Example Re-sampling: In this case, the examples are appropriately re-sampled, so that rare
classes are over-sampled, whereas the normal classes are under-sampled. A standard classiﬁeris applied to the re-sampled data without any modiﬁcation. From a technical perspective, thisapproach is equivalent to example weighting. However, from a computational perspective,
such an approach has the advantage that the ne wly re-sampled data has much smaller size.
This is because most of the examples in the data correspond to the normal class, which isdrastically under-sampled, whereas the rare class is typically only mildly over-sampled.
Many variations of the rare class detection problem are possible, in which either examples of a
single class are available, or the normal class is contaminated with rare class examples. A survey
of algorithms for rare class learning may be found in [25]. This topic is discussed in detail in
Chapter 17.
1.4.2 Distance Function Learning
Distance function learning is an important problem that is closely related to data classiﬁcation.
In this problem it is desirable to relate pairs of data instances to a distance value with the use of ei-

An Introduction to Data Classiﬁcation 23
ther supervised or unsupervised methods [3]. For example, consider the case of an image collection,
in which the similarity is deﬁned on the basis of a user-centered semantic criterion. In such a case,the use of standard distance functions such as th e Euclidian metric may not reﬂect the semantic sim-
ilarities between two images well, because they are based on human perception, and may even varyfrom collection to collection. Thus, the best way to address this issue is to explicitly incorporate
human feedback into the learning process. Typically , this feedback is incorpor ated either in terms of
pairs of images with explicit distance values, or in terms of rankings of different images to a given
target image. Such an approach can be used for a vari ety of different data domains. This is the train-
ing data that is used for learning purposes. A detailed survey of distance function learning methodsis provided in [92]. The topic of distance function learning is discussed in detail in Chapter 18.
1.4.3 Ensemble Learning for Data Classiﬁcation
A meta-algorithm is a classiﬁcation method that re-uses one or more currently existing classiﬁ-
cation algorithm by applying either multiple models for robustness, or combining the results of thesame algorithm with different parts of the data. Th e general goal of the algorithm is to obtain more
robust results by combining the results from multiple training models either sequentially or indepen-
dently. The overall error of a classiﬁcation model depends upon the bias and variance, in addition to
the intrinsic noise present in the data. The bias of a classiﬁer depends upon the fact that the decision
boundary of a particular model may not correspond to the true decision boundary. For example, the
training data may not have a linear decision boundary, but an SVM classiﬁer will assume a lineardecision boundary. The variance is based on the random variations in the particular training data set.
Smaller training data sets will have larger variance. Different forms of ensemble analysis attempt to
reduce this bias and variance. The reader is referred to [84] for an excellent discussion on bias and
variance.
Meta-algorithms are used commonly in many data mining problems such as clustering and out-
lier analysis [1,2] in order to obtain more accurate results from different data mining problems. The
area of classiﬁcation is the richest one from the pe rspective of meta-algorithms, because of its crisp
evaluation criteria and relative ease in combining the results of different algorithms. Some examplesof popular meta-algorithms are as follows:
•Boosting: Boosting [40] is a common technique used in classiﬁcation. The idea is to focus
on successively difﬁcult portions of the data set in order to create models that can classifythe data points in these portions more accurately, and then use the ensemble scores over allthe components. A hold-out approach is used in order to determine the incorrectly classiﬁed
instances for each portion of the data set. Thus, t he idea is to sequentially determine better
classiﬁers for more difﬁcult portions of the data, and then combine the results in order to
obtain a meta-classiﬁer, which works well on all parts of the data.
•Bagging: Bagging [24] is an approach that works with random data samples, and combines
the results from the models constructed using different samples. The training examples foreach classiﬁer are selected by sampling with replacement. These are referred to as bootstrap
samples. This approach has often been shown to p rovide superior results in certain scenarios,
though this is not always the case. This approach is not effective for reducing the bias, but canreduce the variance, because of the speci ﬁc random aspects of the training data.
•Random F orests: Random forests [23] are a method that use sets of decision trees on either
splits with randomly generated vectors, or ra ndom subsets of the training data, and com-
pute the score as a function of these different components. Typically, the random vectors aregenerated from a ﬁxed probability distributi on. Therefore, random forests can be created by
either random split selection, or random input selection. Random forests are closely related

24 Data Classiﬁcation: Algorithms and Applications
to bagging, and in fact bagging with decision tr ees can be considered a special case of ran-
dom forests, in terms of how the sample is selected (bootstrapping). In the case of random
forests, it is also possible to create the trees in a lazy way, which is tailored to the particular
test instance at hand.
•Model Averaging and Combination: This is one of the most common models used in ensemble
analysis. In fact, the random forest method discussed above is a special case of this idea. Inthe context of the classiﬁcation problem, many Bayesian methods [34] exist for the model
combination process. The use of different mode ls ensures that the error caused by the bias of
a particular classiﬁer does not dominate the classiﬁcation results.
•Stacking: Methods such as stacking [90] also combine different models in a variety of ways,
such as using a second-level classiﬁer in order to perform the combination. The output ofdifferent ﬁrst-level classiﬁers is used to cr eate a new feature representation for the second
level classiﬁer. These ﬁrst level classiﬁers may be chosen in a variety of ways, such as using
different bagged classiﬁers, or by using different training models. In order to avoid overﬁtting,
the training data needs to be divided into two subsets for the ﬁrst and second level classiﬁers.
•Bucket of Models: In this approach [94] a “hold-out” portion of the data set is used in order to
decide the most appropriate model. The most appropriate model is one in which the highest
accuracy is achieved in the held out data set. In essence, this approach can be viewed as a
competition or bake-off contest between the different models.
The area of meta-algorithms in classiﬁcation is ver y rich, and different variations may work better
in different scenarios. An overview of different meta-algorithms in classiﬁcation is provided inChapter 19.
1.4.4 Enhancing Classiﬁcation Methods with Additional Data
In this class of methods, additional labeled or unlabeled data is used to enhance classiﬁcation.
Both these methods are used when there is a direct paucity of the underlying training data. In the case
of transfer learning, additional training (labeled) data from a different domain or problem is used
to supervise the classiﬁcation process. On the other hand, in the case of semi-supervised learning,
unlabeled data is used to enhance the classiﬁcation process. These methods are brieﬂy described in
this section.
1.4.4.1 Semi-Supervised Learning
Semi-supervised learning methods improve the effectiveness of learning methods with the use
ofunlabeled data, when only a small amount of labeled data is available. The main difference
between semi-supervised learning and transfer learning methods is that unlabeled data with the
same features is used in the former, whereas extern al labeled data (possibly from a different source)
is used in the latter. A key question arises as to why unlabeled data should improve the effectiveness
of classiﬁcation in any way, when it does not provide any additional labeling knowledge. The reason
for this is that unlabeled data provides a good idea of the manifolds in which the data is embedded,
as well as the density structure of the data in terms of the clusters and sparse regions. The keyassumption is that the classiﬁcation labels exhibit a smooth variation over different parts of the
manifold structure of the underlying data. This manifold structure can be used to determine feature
correlations, and joint feature distributions, which are very helpful for classiﬁcation. The semi-supervised setting is also sometimes referred to as the transductive setting, when the test instances
must be speciﬁed together with the training insta nces. Some problem settings such as collective
classiﬁcation of network data are naturally transductive.

An Introduction to Data Classiﬁcation 25
CLASS A
CLASS B OLD DECISION BOUNDARY CLASS A
CLASS B
X X X X X
X X
X
X X
X X X
X X
X X X X
X
X X X X
X
X X
X
X X
X X X
X X
X X
X
(a) only labeled examples (b) l abeled and unlabeled examples
FIGURE 1.4 : Impact of unsupervised examples on classiﬁcation process.
The motivation of semi-supervised learning is that knowledge of the dense regions in the space
and correlated regions of the space are helpful for classiﬁcation. Consider the two-class example
illustrated in Figure 1.4(a), in which only a single training example is available for each class.
In such a case, the decision boundary between the two classes is the straight line perpendicular
to the one joining the two classes. However, s uppose that some additional unsupervised examples
are available, as illustrated in Figure 1.4(b). These unsupervised examples are denoted by ‘x’. Insuch a case, the decision boundary changes from Figure 1.4(a). The major assumption here is thatthe classes vary less in dense regions of the training data, because of the smoothness assumption.
As a result, even though the added examples do not have labels, they contribute signiﬁcantly toimprovements in classiﬁcation accuracy.
In this example, the correlations between feature values were es timated with unlabeled training
data. This has an intuitive interpretation in the context of text data, where joint feature distributions
can be estimated with unlabeled data. For exampl e, consider a scenario, where training data is
available about predicting whether a document is the “ politics ” category. It may be possible that the
word “ Obama” (or some of the less common words) may not occur in any of the (small number
of) training documents. However, the word “ Obama” may often co-occur with many features of the
“politics ” category in the unlabeled instances. Thus, t he unlabeled instances can be used to learn the
relevance of these less common features to the classiﬁcation process, especially when the amountof available training data is small.
Similarly, when the data is clust ered, each cluster in the data is likely to predominantly contain
data records of one class or the other. The iden tiﬁcation of these clusters only requires unsuper-
vised data rather than labeled data. Once the cl usters have been identiﬁed from unlabeled data,
only a small number of labeled examples are required in order to determine conﬁdently which labelcorresponds to which cluster. Therefore, when a test example is classiﬁed, its clustering structure
provides critical information for its classiﬁca tion process, even when a smaller number of labeled
examples are available. It has been argued in [67] that the accuracy of the approach may increase ex-
ponentially with the number of labeled examples, as long as the assumption of smoothness in label
structure variation holds true. Of course, in real life, this may not be true. Nevertheless, it has been
shown repeatedly in many domains that the addition of unlabeled data provides signiﬁcant advan-tages for the classiﬁcation process. An argument for the effectiveness of semi-supervised learning
that uses the spectral clustering structure of the data may be found in [18]. In some domains suchas graph data, semi-supervised learning is the only way in which classiﬁcation may be performed.
This is because a given node may have very few neighbors of a speciﬁc class.
Semi-supervised methods are implemented in a wide variety of ways. Some of these methods
directly try to label the unlabeled data in order t o increase the size of the training set. The idea is

26 Data Classiﬁcation: Algorithms and Applications
to incrementally add the most conﬁdently predicted label to the training data. This is referred to as
self training. Such methods have the downside that they r un the risk of overﬁtting. For example,
when an unlabeled example is added to the training data with a speciﬁc label, the label might be
incorrect because of the s peciﬁc characteristics of the featur e space, or the classiﬁer. This might
result in further propagation of the errors. The results can be quite severe in many scenarios.
Therefore, semi-supervised methods need to be carefully designed in order to avoid overﬁtting.
An example of such a method is co-training [21], which partitions the attribute set into two subsets,
on which classiﬁer models are independently constructed. The top label predictions of one classiﬁerare used to augment the training data of the other, and vice-versa. Speciﬁcally, the steps of co-
training are as follows:
1. Divide the feature space into two disjoint subsets f
1and f2.
2. Train two independent classiﬁer models M1and M2, which use the disjoint feature sets f1
and f2, respectively.
3. Add the unlabeled instance with the most conﬁdently predicted label from M1to the training
data for M2and vice-versa.
4. Repeat all the above steps.
Since the two classiﬁers are independently constructed on different feature sets, such an approachavoids overﬁtting. The partitioning of the feature set into f
1and f2can be performed in a variety
of ways. While it is possible to perform random par titioning of features, it is generally advisable
to leverage redundancy in the feature set to construct f1and f2. Speciﬁcally, each feature set fi
should be picked so that the features in fj(for j/negationslash=i) are redundant with respect to it. Therefore,
each feature set represents a different view of the d ata, which is sufﬁcient for classiﬁcation. This
ensures that the “conﬁdent” labels assigned to the other classiﬁer are of high quality. At the sametime, overﬁtting is avoided to at least some degree, because of the disjoint nature of the feature
set used by the two classiﬁers. Typically, an err oneously assigned class label will be more easily
detected by the disjoint feature set of the other classiﬁer, which was not used to assign the erroneous
label. For a test instance, each of the classiﬁers is used to make a prediction, and the combination
score from the two classiﬁers may be used. For example, if the naive Bayes method is used as thebase classiﬁer, then the product of the two classiﬁer scores may be used.
The aforementioned methods are generic meta-algorithms for semi-supervised leaning. It is also
possible to design variations of existing classiﬁcation algorithms such as the EM-method, or trans-ductive SVM classiﬁers. EM-based methods [67] are very popular for text data. These methods
attempt to model the joint probability distributions of the features and the labels with the use of
partially supervised clustering methods. This allows the estimation of the conditional probabilitiesin the Bayes classiﬁer to be treated as missing data, for which the EM-algorithm is very effec-
tive. This approach shows a connection between the partially supervised clustering and partially
supervised classiﬁcation problems. The results show that partially supervised classiﬁcation is mosteffective, when the clusters in the data correspond to the different classes. In transductive SVMs,
the labels of the unlabeled examples are also tr eated as integer decision variables. The SVM for-
mulation is modiﬁed in order to determine the maximum margin SVM, with the best possible labelassignment of unlabeled examples. Surveys on semi-supervised methods may be found in [29, 96].
Semi-supervised methods are discussed in Chapter 20.
1.4.4.2 Transfer Learning
As in the case of semi-supervised learning, transfer learning methods are used when there is a
direct paucity of the underlying training data. However, the difference from semi-supervised learn-
ing is that, instead of using unlabeled data, labeled data from a different domain is used to enhance

An Introduction to Data Classiﬁcation 27
the learning process. For example, consider the case of learning the class label of Chinese docu-
ments, where enough training data is not available about the documents. However, similar Englishdocuments may be available that contain training labels. In such cases, the knowledge in training
data for the English documents can be transferred to the Chinese document scenario for more ef-
fective classiﬁcation. Typically, this process requires some kind of “bridge” in order to relate the
Chinese documents to the English documents. An example of such a “bridge” could be pairs of
similar Chinese and English documents though many other models are possible. In many cases,a small amount of auxiliary training data in the form of labeled Chinese training documents may
also be available in order to further enhance the effectiveness of the transfer process. This general
principle can also be applied to cross-category or cross-domain scenarios where knowledge fromone classiﬁcation category is used to enhance the learning of another category [71], or the knowl-
edge from one data domain (e.g., text) is used to enha nce the learning of another data domain (e.g.,
images) [36, 70, 71, 95]. Broadly speaking, transfer learning methods fall into one of the followingfour categories:
1.Instance-Based Transfer: In this case, the feature spaces of the two domains are highly over-
lapping; even the class labels may be the same. Therefore, it is possible to transfer knowledge
from one domain to the other by simply re-weighting the features.
2.Feature-Based Transfer: In this case, there may be some overlaps among the features, but
a signiﬁcant portion of the feature space may be different. Often, the goal is to perform atransformation of each feature set into a new low dimensional space, which can be shared
across related tasks.
3.Parameter-Based Transfer: In this case, the motivation is that a good training model has
typically learned a lot of structure. Therefore, if two tasks are related, then the structure canbe transferred to learn the target task.
4.Relational-Transfer Learning: The idea here is that if two domains are related, they may share
some similarity relations among objects. These similarity relations can be used for transfer
learning across domains.
The major challenge in such transfer learning methods is that negative transfer can be caused in
some cases when the side information used is very noisy or irrelevant to the learning process. There-
fore, it is critical to use the transfer learning pr ocess in a careful and judicious way in order to truly
improve the quality of the results. A survey on transfer learning methods may be found in [68], anda detailed discussion on this topic may be found in Chapter 21.
1.4.5 Incorporating Human Feedback
A different way of enhancing the classiﬁcation process is to use some form of human supervision
in order to improve the effectiveness of the classiﬁcation process. Two forms of human feedbackare quite popular, and they correspond to active learning and visual learning, respectively. These
forms of feedback are different in th at the former is typically focussed on label acquisition with
human feedback, so as to enhance the training data. The latter is focussed on either visually creating
a training model, or by visually performing the cla ssiﬁcation in a diagnostic way. Nevertheless, both
forms of incorporating human feedback work with the assumption that the active input of a user canprovide better knowledge for the classiﬁcation process. It should be pointed out that the feedbackin active learning may not always come from a user. Rather a generic concept of an oracle (such as
Amazon Mechanical Turk ) may be available for the feedback.

28 Data Classiﬁcation: Algorithms and Applications
ClassAC l a s s B
(a) Class Separation(b) Random Sample with SVM
Classiﬁer(c) Active Sample with SVM Clas-siﬁer
FIGURE 1.5 : Motivation of active learning.
1.4.5.1 Active Learning
Most classiﬁcation algorithms assume that th e learner is a passive recipient of the data set,
which is then used to create the training model. Thus, the data collection phase is cleanly separated
out from modeling, and is generally not addressed i n the context of model construction. However,
data collection is costly, and is often the (cost) bottleneck for many classiﬁcation algorithms. In
active learning, the goal is to collect more labels during the learning process in order to improve
the effectiveness of the classiﬁcation process at a low cost. Therefore, the learning process and data
collection process are tightly in tegrated with one another and e nhance each other. Typically, the
classiﬁcation is performed in an interactive way with the learner providing well-chosen examples to
the user, for which the use r may then provide labels.
For example, consider the two-class example of Figure 1.5. Here, we have a very simple division
of the data into two classes, which is shown by a vertical dotted line, as illustrated in Figure 1.5(a).The two classes here are labeled by A and B. Consider the case where it is possible to query onlyseven examples for the two different classes. In this case, it is quite possible that the small number
of allowed samples may result in a training data which is unrepresentative of the true separation
between the two classes. Consider the case when an SVM classiﬁer is used in order to construct
a model. In Figure 1.5(b), we have shown a total of seven samples randomly chosen from the
underlying data. Because of the i nherent noisiness in the process of picking a small number of
samples, an SVM classiﬁer will be unable to accurately divide the data space. This is shown in
Figure 1.5(b), where a portion of the data space is incorrectly classiﬁed, because of the error of
modeling the SVM classiﬁer. In Figure 1.5(c), we have shown an example of a well chosen set of
seven instances along the decision boundary of the two classes. In this case, the SVM classiﬁer is
able to accurately model the decision regions betw een the two classes. This is because of the careful
choice of the instances chosen by the active learning process. An important point to note is that itis particularly useful to sample instances that can clearly demarcate the decision boundary between
the two classes.
In general, the examples are typically chosen for which the learner has the greatest level of
uncertainty based on the current training knowledge and labels. This choice evidently provides thegreatest additional information to the learner in cases where the greatest uncertainty exists aboutthe current label. As in the case of semi-supervised learning, the assumption is that unlabeled data
is copious, but acquiring labels for it is expensive. Therefore, by using the help of the learner in
choosing the appropriate examples to label, it is possible to greatly reduce the effort involved in the
classiﬁcation process. Active learning algorith ms often use support vector machines, because the
latter are particularly good at determining the boundaries between the different classes. Examples
that lie on these boundaries are good candidates to query the user, because the greatest level of
uncertainty exists for these examples. Numerous criteria exist for training example choice in active

An Introduction to Data Classiﬁcation 29
learning algorithms, most of which try to either reduce the uncertainty in classiﬁcation or reduce the
error associated with the classiﬁcation process. Some examples of criteria that are commonly used
in order to query the learner are as follows:
•Uncertainty Sampling: In this case, the learner queries the user for labels of examples, for
which the greatest level of uncertainty exists about its correct output [45].
•Query by Committee (QBC): In this case, the learner queries the user for labels of examples
in which a committee of classiﬁers have the greatest disagreement. Clearly, this is another
indirect way to ensure that examples with the greatest uncertainty are queries [81].
•Greatest Model Change: In this case, the learner queries the user for labels of examples,
which cause the greatest level of change fro m the current model. The goal here is to learn
new knowledge that is not currently incorporated in the model [27].
•Greatest Error Reduction: In this case, the learner queries the user for labels of examples,
which causes the greatest reduction of error in the current example [28].
•Greatest V ariance Reduction: In this case, the learner queries the user for examples, which
result in greatest reduction in output variance [28]. This is actually similar to the previouscase, since the variance is a component of the total error.
•Representativeness: In this case, the learner queries the user for labels that are most represen-
tative of the underlying data. Typically, this approach combines one of the aforementioned
criteria (such as uncertainty sampling or QBC) with a representativeness model such as a
density-based method in order to perform the classiﬁcation [80].
These different kinds of models may work well in different kinds of scenarios. Another form of
active learning queries the data vertically . In other words, instead of examples, it is learned which
attributes to collect, so as to minimize the error at a given cost level [62]. A survey on active learning
methods may be found in [79]. The topic of active learning is discussed in detail in Chapter 22.
1.4.5.2 Visual Learning
The goal of visual learning is typically related to, but different from, active learning. While
active learning collects examples from the user, visual learning takes the help of the user in the
classiﬁcation process in either creating the traini ng model or using the model for classiﬁcation of a
particular test instance. This help can be received by learner in two ways:
•Visual feedback in construction of training models: In this case, the feedback of the user may
be utilized in constructing the best training model. Since the user may often have importantdomain knowledge, this visual feedback may often result in more effective models. For ex-
ample, while constructing a decision tree cla ssiﬁer, a user may provide important feedback
about the split points at various levels of the tree. At the same time, a visual representation
of the current decision tree may be provided to the user in order to facilitate more intuitive
choices. An example of a decision tree that is constructed with the use of visual methods is
discussed in [17].
•Diagnostic classiﬁcation of individual test instances: In this case, the feedback is provided by
the user during classiﬁcation of test instances, rather than during the process of constructionof the model. The goal of this method is different, in that it enables a better understanding ofthecausality of a test instance belonging to a particular class. An example of a visual method
for diagnostic classiﬁcation, which uses exploratory and visual analysis of test instances, isprovided in [11]. Such a method is not suitable for classifying large numbers of test instancesin batch. It is typically suitable for understanding the classiﬁcation behavior of a small number
of carefully selected test instances.

30 Data Classiﬁcation: Algorithms and Applications
A general discussion on visual data mining methods is found in [10, 47, 49, 55, 83]. A detailed
discussion of methods for visual classiﬁcation is provided in Chapter 23.
1.4.6 Evaluating Classiﬁcation Algorithms
An important issue in data classiﬁcation is that of evaluation of classiﬁcation algorithms. How
do we know how well a classiﬁcation algorithm is performing? There are two primary issues thatarise in the evaluation process:
•Methodology used for evaluation: Classiﬁcation algorithms require a training phase and a
testing phase, in which the test examples are cleanly separated from the training data. How-
ever, in order to evaluate an algorithm, some of the labeled examples must be removed from
the training data, and the model is constructed on these examples. The problem here is that the
removal of labeled examples implicitly underestim ates the power of the classiﬁer, as it relates
to the set of labels already available. Therefore, how should this removal from the labeledexamples be performed so as to not impact the learner accuracy too much?
V arious strategies are possible, such as hold out ,bootstrapping,a n d cross-validation ,o fw h i c h
the ﬁrst is the simplest to implement, and the la st provides the greatest accuracy of implemen-
tation. In the hold-out approach, a ﬁxed percentage of the training examples are “held out,”
and not used in the training. These examples are then used for evaluation. Since only a subset
of the training data is used, the evaluation tends to be pessimistic with the approach. Somevariations use stratiﬁed sampling, in which each class is sampled independently in proportion.
This ensures that random variations of class frequency between training and test examples areremoved.
In bootstrapping, sampling with replacement is used for creating the training examples. The
most typical scenario is that nexamples are sampled with replacement, as a result of which
the fraction of examples not sampled is equal to (1−1/n)
n≈1/e,w h e r e eis the basis of
the natural logarithm. The class accuracy is th en evaluated as a weighted combination of the
accuracy a1on the unsampled (test) examples, and the accuracy a2on the full labeled data.
The full accuracy Ais given by:
A=(1−1/e)·a1+(1/e)·a2 (1.23)
This procedure is repeated over multiple bootst rap samples and the ﬁnal accuracy is reported.
Note that the component a2tends to be highly optimistic, as a result of which the bootstrap-
ping approach produces highly optimistic estimates. It is most appropriate for smaller datasets.
In cross-validation, the training data is divided into a set of kdisjoint subsets. One of the k
subsets is used for testing, whereas the other (k−1)subsets are used for training. This process
is repeated by using each of the ksubsets as the test set, and the error is averaged over all
possibilities. This has the advantage that all examples in the labeled data have an opportunity
to be treated as test examples. Furthermore, when kis large, the training data size approaches
the full labeled data. Therefore, such an a pproach approximates the accuracy of the model
using the entire labeled data well. A special cas e is “leave-one-out” cross-validation, where
kis chosen to be equal to the number of training examples, and therefore each test segment
contains exactly one example. This is, however, expensive to implement.
•Quantiﬁcation of accuracy: This issue deals with the problem of quantifying the error of
a classiﬁcation algorithm. At ﬁrst sight, it would seem that it is most beneﬁcial to use a
measure such as the absolute classiﬁcation accu racy, which directly computes the fraction
of examples that are correctly classiﬁed. However, this may not always be appropriate in

An Introduction to Data Classiﬁcation 31
all cases. For example, some algorithms may hav e much lower variance across different data
sets, and may therefore be more desirable. In this context, an important issue that arises is that
of the statistical signiﬁcance of the results, when a particular classiﬁer performs better than
another on a data set. Another issue is that the output of a classiﬁcation algorithm may either
be presented as a discrete label for the test instance, or a numerical score, which represents the
propensity of the test instance to belong to a speciﬁc class. For the case where it is presented
as a discrete label, the accuracy is the most appropriate score.
In some cases, the output is presented as a numeri cal score, especially when the class is rare.
In such cases, the Precision-Recall or ROC curves may need to be used for the purposes of
classiﬁcation evaluation. This is particularly important in imbalanced and rare-class scenarios.Even when the output is presented as a binary label, the evaluation methodology is different
for the rare class scenario. In the rare class scen ario, the misclassiﬁcation of the rare class
is typically much more costly than that of the normal class. In such cases, cost sensitivevariations of evaluation models may need to be used for greater robustness. For example, the
cost sensitive accuracy weights the rare class a nd normal class examples differently in the
evaluation.
An excellent review of evaluation of classiﬁcation algorithms may be found in [52]. A discussionof evaluation of classiﬁcation algorithms is provided in Chapter 24.
1.5 Discussion and Conclusions
The problem of data classiﬁcation has been widely studied in the data mining and machine
learning literature. A wide variet y of methods are available for data cl assiﬁcation, such as decision
trees, nearest neighbor methods, rule-based methods, neural networks, or SVM classiﬁers. Different
classiﬁers may work more effectively with differ ent kinds of data sets and application scenarios.
The data classiﬁcation problem is relevant in the context of a variety of data types, such as
text, multimedia, network data, time-series and sequence data. A new form of data is probabilistic
data, in which the underlying data is uncertain and may require a different type of processing inorder to use the uncertainty as a ﬁrst-class variable. Different kinds of data may have different kinds
of representations and contextual dependencies. This requires the design of methods that are well
tailored to the different data types.
The classiﬁcation problem has numerous variations that allow the use of either additional train-
ing data, or human intervention in order to improve the underlying results. In many cases, meta-
algorithms may be used to signiﬁcantly impr ove the quality of the underlying results.
The issue of scalability is an important one in the context of data classiﬁcation. This is because
data sets continue to increase in size, as data collection technologies have improved over time. Manydata sets are collected continuously, and this has lead to large volumes of data streams. Even in cases
where very large volumes of data are collected, big data technologies need to be designed for the
classiﬁcation process. This area of research is still in its infancy, and is rapidly evolving over time.
Bibliography
[1] C. Aggarwal. Outlier Analysis , Springer, 2013.
[2] C. Aggarwal and C. Reddy. Data Clustering: Algorithms and Applications , CRC Press, 2013.

32 Data Classiﬁcation: Algorithms and Applications
[3] C. Aggarwal. Towards Systematic Design of Distance Functions in Data Mining Applications,
ACM KDD Conference , 2003.
[4] C. Aggarwal. Data Streams: Models and Algorithms , Springer, 2007.
[5] C. Aggarwal. On Density-based Tr ansforms for Uncertain Data Mining, ICDE Conference ,
2007.
[6] C. Aggarwal. Social Network Data Analytics , Springer, Chapter 5, 2011.
[7] C. Aggarwal and H. Wang. Managing and Mining Graph Data, Springer, 2010.
[8] C. Aggarwal and C. Zhai. Mining Text Data , Chapter 11, Springer, 2012.
[9] C. Aggarwal and P . Y u. On Classiﬁcation of High Cardinality Data Streams. SDM Conference ,
2010.
[10] C. Aggarwal. Towards Effective and Interpretable Data Mining by Visual Interaction, ACM
SIGKDD Explorations , 2002.
[11] C. Aggarwal. Toward exploratory test-instance-centered diagnosis in high-dimensional classi-
ﬁcation, IEEE Transactions on Knowledge and Data Engineering, 19(8):1001–1015, 2007.
[12] C. Aggarwal and C. Zhai. A survey of text classiﬁcation algorithms, Mining Text Data ,
Springer, 2012.
[13] C. Aggarwal, J. Han, J. Wang, and P . Y u. A framework for classiﬁcation of evolving data
streams. In IEEE TKDE Journal , 2006.
[14] C. Aggarwal. On Effective Classiﬁcation of Strings with Wavelets, ACM KDD Conference ,
2002.
[15] D. Aha, D. Kibler, and M. Albert. Instance-based learning algorithms, Machine Learning ,
6(1):37–66, 1991.
[16] D. Aha. Lazy learning: Special issue editorial. Artiﬁcial Intelligence Review , 11:7–10, 1997.
[17] M. Ankerst, M. Ester, and H.-P . Kriegel. Towards an Effective Cooperation of the User and the
Computer for Classiﬁcation, ACM KDD Conference , 2000.
[18] M. Belkin and P . Niyogi. Semi-supervised learning on Riemannian manifolds, Machine Learn-
ing, 56:209–239, 2004.
[19] C. Bishop. Neural Networks for Pattern Recognition , Oxford University Press, 1996.
[20] C. Bishop. Pattern Recognition and Machine Learning , Springer, 2007.
[21] A. Blum and T. Mitchell. Combining label ed and unlabeled data with co-training. Proceedings
of the Eleventh Annual Conference on Computational Learning Theory , pages 92–100, 1998.
[22] L. Breiman. Classiﬁcation and regression trees . CRC Press, 1993.
[23] L. Breiman. Random forests. Journal Machine Learning Archive , 45(1):5–32, 2001.
[24] L. Breiman. Bagging predictors. Machine Learning , 24(2):123–140, 1996.
[25] N. V . Chawla, N. Japkowicz, and A. Kotcz. Ed itorial: Special Issue on Learning from Imbal-
anced Data Sets, ACM SIGKDD Explorations Newsletter , 6(1):1–6, 2004.

An Introduction to Data Classiﬁcation 33
[26] W. Cohen and Y . Singer. Context-sensitive learning methods for text categorization, ACM
Transactions on Information Systems , 17(2):141–173, 1999.
[27] D. Cohn, L. Atlas, and R. Ladner. Improving generalization with active learning, Machine
Learning, 5(2):201–221, 1994.
[28] D. Cohn, Z. Ghahramani and M. Jordan. Active learning with statistical models, Journal of
Artiﬁcial Intelligence Research , 4:129–145, 1996.
[29] O. Chapelle, B. Scholkopf, and A. Zien. Semi-supervised learning. V ol. 2 ,Cambridge :M I T
Press, 2006.
[30] N. Cristianini and J. Shawe-Taylor. An Introduction to Support V ector Machines and Other
Kernel-based Learning Methods , Cambridge University Press, 2000.
[31] P . Clark and T. Niblett. The CN2 Induction algorithm, Machine Learning , 3(4):261–283, 1989.
[32] B. Clarke. Bayes model averaging and stacking when model approximation error cannot be
ignored, Journal of Machine Learning Research , pages 683–712, 2003.
[33] G. Cormode and S. Muthukrishnan, An improved data-stream summary: The count-min sketch
and its applications, Journal of Algorithms , 55(1), (2005), pp. 58–75.
[34] P . Domingos. Bayesian Averaging of Cl assiﬁers and the Overﬁtting Problem. ICML Confer-
ence, 2000.
[35] I. Dagan, Y . Karov, and D. Roth. Mistake-driven Learning in Text Categorization, Proceedings
of EMNLP , 1997.
[36] W. Dai, Y . Chen, G.-R. Xue, Q. Yang, and Y . Y u. Translated learning: Transfer learning across
different feature spaces. Proceedings of Advances in Neural Information Processing Systems ,
2008.
[37] J. Dean and S. Ghemawat. MapReduce: A ﬂexible data processing tool, Communication of the
ACM , 53:72–77, 2010.
[38] P . Domingos and M. J. Pazzani. On the optima lity of the simple Bayesian classiﬁer under
zero-one loss. Machine Learning , 29(2–3):103–130, 1997.
[39] R. Duda, P . Hart, and D. Stork, Pattern Classiﬁcation , Wiley, 2001.
[40] Y . Freund, R. Schapire. A decision-theoretic ge neralization of online learning and application
to boosting, Lecture Notes in Computer Science , 904:23–37, 1995.
[41] P . Flajolet and G. N. Martin. Probabilistic c ounting algorithms for data base applications.
Journal of Computer and System Sciences , 31(2):182–209, 1985.
[42] J. Gehrke, V . Ganti, R. Ramakrishnan, and W.-Y . Loh. BOA T: Optimistic Decision Tree Con-
struction, ACM SIGMOD Conference , 1999.
[43] J. Gehrke, R. Ramakrishnan, and V . Ganti. Rainforest—a framework for fast decision tree
construction of large datasets, VLDB Conference , pages 416–427, 1998.
[44] A. Ghoting, R. Krishnamurthy, E. Pednault, B. Reinwald, V . Sindhwani, S. Tatikonda, T.
Y uanyuan, and S. V aithyanathan. SystemML: Declarative Machine Learning with MapRe-
duce, ICDE Conference , 2011.

34 Data Classiﬁcation: Algorithms and Applications
[45] D. Lewis and J. Catlett. Heterogeneous Uncertainty Sampling for Supervised Learning, ICML
Conference , 1994.
[46] L. Hamel. Knowledge Discovery with Support V ector Machines , Wiley, 2009.
[47] C. Hansen and C. Johnson. Visualization Handbook , Academic Press, 2004.
[48] M. Mehta, R. Agrawal, and J. Rissanen. SLIQ: A Fast Scalable Classiﬁer for Data Mining,
EDBT Conference, 1996.
[49] M. C. F. de Oliveira and H. Levkowitz. Visual Data Mining: A Survey, IEEE Transactions on
Visualization and Computer Graphics , 9(3):378–394. 2003.
[50] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning: Data Mining,
Inference, and Prediction , Springer, 2013.
[51] S. Haykin. Neural Networks and Learning Machines , Prentice Hall, 2008.
[52] N. Japkowicz and M. Shah. Evaluating Learning Algorithms: A Classiﬁcation Perspective ,
Cambridge University Press, 2011.
[53] T. Joachims. Making Large scale SVMs practical, Advances in Kernel Methods, Support V ector
Learning, pages 169–184, Cambridge: MIT Press , 1998.
[54] T. Joachims. Training Linear SVMs in Linear Time, KDD , pages 217–226, 2006.
[55] D. Keim. Information and visual data mining, IEEE Transactions on Visualization and Com-
puter Graphics , 8(1):1–8, 2002.
[56] W. Lam and C. Y . Ho. Using a Generalized Instance Set for Automatic Text Categorization.
ACM SIGIR Conference, 1998.
[57] N. Littlestone. Learning quickly when irrele vant attributes abound: A new linear-threshold
algorithm. Machine Learning , 2:285–318, 1988.
[58] B. Liu, W. Hsu, and Y . Ma. Integrating Classiﬁcation and Association Rule Mining, ACM
KDD Conference , 1998.
[59] H. Liu and H. Motoda. Feature Selection for Knowledge Discovery and Data Mining , Springer,
1998.
[60] R. Mayer. Multimedia Learning , Cambridge University Press, 2009.
[61] G. J. McLachlan. Discriminant analysis and statistical pattern recognition , Wiley-
Interscience, V ol. 544, 2004.
[62] P . Melville, M. Saar-Tsechansky, F. Prov ost, and R. Mooney. An E xpected Utility Approach
to Active Feature-V alue Acquisition. IEEE ICDM Conference , 2005.
[63] T. Mitchell. Machine Learning , McGraw Hill, 1997.
[64] K. Murphy. Machine Learning: A Probabilistic Perspective , MIT Press, 2012.
[65] S. K. Murthy. Automatic construction of decision trees from data: A multi-disciplinary survey,
Data Mining and Knowledge Discovery , 2(4):345–389, 1998.
[66] H. T. Ng, W. Goh and K. Low. Feature Selection, Perceptron Learning, and a Usability Case
Study for Text Categorization. ACM SIGIR Conference, 1997.

An Introduction to Data Classiﬁcation 35
[67] K. Nigam, A. McCallum, S. Thrun, and T. Mitche ll. Text classiﬁcation from labeled and unla-
beled documents using EM, Machine Learning , 39(2–3):103–134, 2000.
[68] S. J. Pan and Q. Yang. A survey on transfer learning, IEEE Transactons on Knowledge and
Data Engineering , 22(10):1345–1359, 2010.
[69] H. Peng, F. Long, and C. Ding. Feature selection based on mutual information: Criteria of max-
dependency, max-relevance, and min-redundancy, IEEE Transactions on Pattern Analysis and
Machine Intelligence , 27(8):1226–1238, 2005.
[70] G. Qi, C. Aggarwal, and T. Huang. Towards Semantic Knowledge Propagation from Text
Corpus to Web Images, WWW Conference , 2011.
[71] G. Qi, C. Aggarwal, Y . Rui, Q. Tian, S. Chang, and T. Huang. Towards Cross-Category Knowl-
edge Propagation for Learning Visual Concepts, CVPR Conference , 2011.
[72] J. Quinlan. C4.5: Programs for Machine Learning , Morgan-Kaufmann Publishers, 1993.
[73] J. R. Quinlan. Induction of decision trees, Machine Learning , 1(1):81–106, 1986.
[74] J. Rocchio. Relevance fee dback information retrieval. The Smart Retrieval System – Experi-
ments in Automatic Document Processing , G. Salton, Ed. Englewood Cliffs, Prentice Hall, NJ:
pages 313–323, 1971.
[75] B. Scholkopf and A. J. Smola. Learning with Kernels: Support V ector Machines, Regulariza-
tion, Optimization, and Beyond , Cambridge University Press, 2001.
[76] I. Steinwart and A. Christmann. Support V ector Machines , Springer, 2008.
[77] H. Schutze, D. Hull, and J. Pedersen. A Comparison of Classiﬁers and Document Representa-
tions for the Routing Problem. ACM SIGIR Conference , 1995.
[78] F. Sebastiani. Machine learning in automated text categorization, ACM Computing Surveys ,
34(1):1–47, 2002.
[79] B. Settles. Active Learning , Morgan and Claypool, 2012.
[80] B. Settles and M. Craven. An analysis of active learning strategies for sequence labeling
tasks, Proceedings of the Conference on Empiri cal Methods in Natural Language Process-
ing (EMNLP) , pages 1069–1078, 2008.
[81] H. Seung, M. Opper, and H. Sompolinsky. Query by Committee. Fifth Annual Workshop on
Computational Learning Theory, 1992.
[82] J. Shafer, R. Agrawal, and M. Mehta. SPRIN T: A scalable parallel classﬁer for data mining,
VLDB Conference , pages 544–555, 1996.
[83] T. Soukop and I. Davidson. Visual Data Mining: T echniques and T ools for Data Visualization,
Wiley, 2002.
[84] P .-N. Tan, M. Steinbach, and V . Kumar. Introduction to Data Mining. Pearson, 2005.
[85] V . V apnik. The Nature of Statistical Learning Theory , Springer, New Y ork, 1995.
[86] H. Wang, W. Fan, P . Y u, and J. Han. Mining Concept-Drifting Data Streams with Ensemble
Classiﬁers, KDD Conference , 2003.
[87] T. White. Hadoop: The Deﬁnitive Guide . Yahoo! Press, 2011.

36 Data Classiﬁcation: Algorithms and Applications
[88] E. Wiener, J. O. Pedersen, and A. S. Weigend. A neural network approach to topic spotting.
SDAIR , pages 317–332, 1995.
[89] D. Wettschereck, D. Aha, and T. Mohri. A review and empirical evaluation of feature weight-
ing methods for a class of lazy learning algorithms, Artiﬁcial Intelligence Review , 11(1–
5):273–314, 1997.
[90] D. Wolpert. Stacked generalization, Neural Networks , 5(2):241–259, 1992.
[91] Z. Xing and J. Pei, and E. Keogh. A brief survey on sequence classiﬁcation. SIGKDD Explo-
rations , 12(1):40–48, 2010.
[92] L. Yang. Distance Metric Learning: A Comprehensive Survey, 2006. http://www.cs.cmu.
edu/ ~liuy/frame_survey_v2.pdf
[93] M. J. Zaki and C. Aggarwal. XRules: A Structural Classiﬁer for XML Data, ACM KDD Con-
ference , 2003.
[94] B. Zenko. Is combining classiﬁers better than selecting the best one? Machine Learning ,
54(3):255–273, 2004.
[95] Y . Zhu, S. J. Pan, Y . Chen, G.-R. Xue, Q. Yang, and Y . Y u. Heterogeneous Transfer Learning
for Image Classiﬁcation. Special Track on AI and the Web, associated with The Twenty-F ourth
AAAI Conference on Artiﬁcial Intelligence , 2010.
[96] X. Zhu and A. Goldberg. Introduction to Semi-Supervised Learning , Morgan and Claypool,
2009.

Copyright Notice

© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.

Acest articol: Classifi Ca tion [625576] (ID: 625576)

Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.

Classifi Ca tion [625576]

Copyright Notice

Université de lOuest de Timișoara [605175]

Sângele la vertebrate [303644]

3.1.Relațiile bilaterale ale României și rolul său pentru crearea unor politici comune de securitate cu țãrile din regiunea Mãrii Negre Prin aderarea… [306795]

Tel: 40 256 403261, Fax: 40 256 403214 [630070]

ECONOMIA ȘTIINȚĂ TEORETICĂ FUNDAMENTALĂ [613935]

.1.Prezentarea firmei SC.ALFACOM.SA [304752]

Copyright Notice

Similar Posts