An E cien t T ec hnique for Do cumen t Clustering SUBMITTED TO SA VITRIBAI PHULE PUNE UNIVERSITY,PUNE IN P AR TIAL FULFILLMENT OF THE REQUIREMENTS F… [600279]

Licență

An E cien t T ec hnique for Do cumen t Clustering SUBMITTED TO SA VITRIBAI PHULE PUNE UNIVERSITY,PUNE IN P AR TIAL FULFILLMENT OF THE REQUIREMENTS F… [600279]

Byadmin ianuarie 1, 2024

A DISSER T A TION REPOR T
ON
An Ecien t T ec hnique for Do cumen t Clustering
SUBMITTED TO
SA VITRIBAI PHULE PUNE UNIVERSITY,PUNE
IN P AR TIAL FULFILLMENT OF THE REQUIREMENTS F OR THE
A W ARD OF DEGREE
Master Of Engineering (Computer Engineering)
BY
Priti Balu Kudal Exam Seat No:
Under the guidance of
Prof.M.M.Naoghare
DEP AR TMENT OF COMPUTER ENGINEERING
Pra v ara Rural Education So ciet ys
SIR VISVESV ARA Y A INSTITUTE OF
TECHNOLOGY,CHINCHOLI
NASIK-422102

CER TIFICA TE
This is to certify that the dissertation rep ort en titled
An Ecien t T ec hnique for Do cumen t Clustering
Submitted b y
Priti Balu Kudal Exam Seat No:
is a b onade w ork carried out b y her under the sup ervision of Prof. M. M. Naoghare and it
is submitted to w ards the partial fulllmen t of the requiremen t of Sa vitribai Ph ule Pune
Univ ersit y , Pune for the a w ard of the degree of Master of Engineering (Computer
Engineering)
(Prof. M.M.Naoghare) (Prof.S.M.Rok ade)
In ternal Guide Head
Departmen t of Comp. Engg Departmen t of Comp. Engg
SVIT,Chinc holi,Nasik SVIT,Chinc holi,Nasik
Seal/Stamp of the College (Dr. G. B. Shinde)
Principal
SVIT,Chinc holi,Nasik
Place: (Prof./Dr. )
Date: External Examiner

Certicate By Guide
This is to certify that Priti Balu Kudal has completed the dissertation w ork
under m y guidance and sup ervision and that, I ha v e v eried the w ork for its
originalit y in do cumen tation, problem statemen t, implemen tation and results
presen ted in the dissertation. An y repro duction of other necessary w ork is with
the prior p ermission and has giv en due o wnership and included in the references.
Place:Chinc holi Prof. M.M.Naoghare
Date:

A CKNO WLEDGEMENT
I tak e this opp ortunit y to express m y heart y thanks to all those who help ed me in the
completion of the dissertation on this topic.
I express m y deep sense of gratitude to m y dissertation guide Prof. M.M.Naoghare ,
Computer Engineering, Sir Visv esv ara y a Institute of T ec hnology , Chinc holi for her guidance
and con tin uous motiv ation. I gratefully ac kno wledge the help pro vided b y her on man y o c-
casions, for impro v emen t of this dissertation with great in terest.
I w ould b e failing in m y duties, if I do not express m y deep sense of gratitude to
Prof.S.M.Rok ade , Head , Computer Engineering Departmen t for p ermitting me to a v ail
the facilit y and constan t encouragemen t.
I w ould b e failing in m y duties, if I do not express m y deep sense of gratitude to
Dr.G.B.Shinde , Principal, Sir Visv esv ara y a Institute of T ec hnology for p ermitting me to
a v ail the facilit y and constan t encouragemen t.
I w ould also lik e to thank to Prof. M.M.Naoghare P .G. Co-ordinator for her great sup-
p ort and excellen t guidance
I w ould lik e to extend m y sincere thanks to m y family mem b ers. It is m y privilege to
ac kno wledge their co op eration during the course of this dissertation. I express m y heartiest
thanks to m y kno wn and unkno wn w ell-wishers for their unreserv ed co op eration, encourage-
men t and suggestions during the course of this dissertation.
Last but not the least, I w ould lik e to thanks to all m y teac hers, and all m y friends who
help ed me with the ev er daun ting task of gathering information for the dissertation.
Priti Balu Kudal

ABSTRA CT
Clustering is a division of data in to groups of similar ob jects. Eac h group, called
cluster, consists of ob jects that are similar b et w een themselv es and dissimilar to ob jects of
other groups. Clustering of do cumen t is imp ortan t for the purp ose of do cumen t organiza-
tion, summarization, topic extraction and information retriev al in an ecien t w a y . Initially ,
clustering is applied for enhancing the information retriev al tec hniques. Of late, clustering
tec hniques ha v e b een applied in the areas whic h in v olv e bro wsing the gathered data or in
categorizing the outcome pro vided b y the searc h engines for the reply to the query raised
b y the users.In do cumen t clustering, h undreds of thousands of les are usually examined.
Muc h of the data in those les consists of unstructured text, whose analysis b y computer
examiners is dicult to b e p erformed. In particular, algorithms for clustering do cumen ts
can facilitate the disco v ery of new and useful kno wledge from the do cumen ts under analysis.
I ha v e prop osed a more ecien t do cumen t clustering algorithm.
Keywor ds : Data Mining, Clustering, Similarit y Measure, Cosine Similarit y T erm F re-
quency .

Con ten ts
A c kno wledgemen t
Abstract
Index
List of Figures
List of T ables
List of Publications
1 SYNOPSIS 1
1.1 OBJECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 MOTIV A TION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 HYPOTHESIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.4 STRA TEGY PLANNED ASSOCIA TED WITH THE DISSER T A TION . . . 3
1.5 REVIEW OF CONFERENCE/JOURNAL P APERS OF SUPPOR TING DIS-
SER T A TION IDEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.6 PLAN OF DISSER T A TION EXECUTION . . . . . . . . . . . . . . . . . . . 5
1.7 PR OBLEM ST A TEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
1.8 PR OBLEM SOL VING APPR O A CH . . . . . . . . . . . . . . . . . . . . . . 6
1.9 EFFICIENCY ISSUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1.10 OUTCOMES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2 TECHNICAL KEYW ORDS 8
2.1 CA TEGORIES AND SUBJECT DESCRIPTOR . . . . . . . . . . . . . . . . 8
2.2 GENERAL TERMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 TECHNICAL KEYW ORDS . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 INTR ODUCTION 10
3.1 DISSER T A TION IDEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3.2 MOTIV A TION OF DISSER T A TION . . . . . . . . . . . . . . . . . . . . . . 11
3.3 LITERA TURE SUR VEY . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
4 PR OBLEM DEFINITION AND SCOPE 16
4.1 PR OBLEM ST A TEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.2 GO ALS AND OBJECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.3 ST A TEMENT OF SCOPE . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
4.4 SOFTW ARE CONTEXT . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.5 MAJOR CONSTRAINT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.6 APPR O A CHES F OR SOL VING THE PR OBLEMS AND EFFICIENCY IS-
SUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4.7 OUTCOMES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
4.8 HARD W ARE RESOUR CES REQUIRED . . . . . . . . . . . . . . . . . . . 18
4.9 SOFTW ARE RESOUR CES REQUIRED . . . . . . . . . . . . . . . . . . . . 18
4.10 AREA OF DISSER T A TION . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
5 DISSER T A TION PLAN 19
5.1 PURPOSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.2 DISSER T A TION PLAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
5.3 GANTT CHAR T . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6 SOFTW ARE REQUIREMENT SPECIFICA TION 21
6.1 INTR ODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
6.1.1 Purp ose And Scop e Of Do cumen t . . . . . . . . . . . . . . . . . . . . 21
6.1.2 Ov erview Of Resp onsibilities Of Dev elop er . . . . . . . . . . . . . . 21
6.2 PR ODUCT O VER VIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3 USA GE SCENARIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.1 User Prole . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.2 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.3.3 Use Case View . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
6.4 D A T A MODEL AND DESCRIPTION . . . . . . . . . . . . . . . . . . . . . 23
6.4.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.5 FUNCTIONAL MODEL AND DESCRIPTION . . . . . . . . . . . . . . . . 24
6.5.1 Flo w Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.5.2 Non F unctional Requiremen t . . . . . . . . . . . . . . . . . . . . . . . 26

6.5.3 F unctional Flo w Diagram . . . . . . . . . . . . . . . . . . . . . . . . 26
6.5.4 Soft w are Qualit y A ttributes . . . . . . . . . . . . . . . . . . . . . . . 26
6.5.5 Design Constrain ts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.5.6 Soft w are In terface Description . . . . . . . . . . . . . . . . . . . . . . 28
6.5.7 RESTRICTIONS, LIMIT A TIONS, AND CONSTRAINTS . . . . . . 28
6.6 V ALID A TION CRITERIA . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.6.1 Classes of tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
6.6.2 Estimated Soft w are resp onse . . . . . . . . . . . . . . . . . . . . . . . 29
7 DET AILED DESIGN DOCUMENT 30
7.1 In tro duction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.1.1 Ov erview of pro ducts . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.2 AR CHITECHTURAL DESIGN . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.2.1 System Arc hitecture: . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
7.3 SYSTEM MODEL . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.4 COMPONENT DESIGN . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.4.1 Class diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.4.2 In teraction Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.4.3 Comp onen t Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.4.4 Deplo ymen t Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
7.4.5 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
7.5 PR OGRAM STR UCTURE . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.5.1 Arc hitecture diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
7.5.2 DESCRIPTION F OR COMPONENT . . . . . . . . . . . . . . . . . 42
8 IMPLEMENT A TION 43
8.1 IMPLEMENT A TION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.2 SNAP SHOTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
8.2.1 Home-Do cumen t Clustering . . . . . . . . . . . . . . . . . . . . . . . 43
8.2.2 Prepro cess Do cumen t . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.2.3 Prepro cess Do cumen t (SELECT) . . . . . . . . . . . . . . . . . . . . 46
8.2.4 Prepro cess Do cumen t (REMO VE) . . . . . . . . . . . . . . . . . . . 48
8.2.5 Prepro cess Do cumen t (STEMMING) . . . . . . . . . . . . . . . . . . 49
8.2.6 Cluster Pro cess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.2.7 K-Means Pro cess (T erm F requency) . . . . . . . . . . . . . . . . . . . 51
8.2.8 K-Means Pro cess (Cen troid & Distance) . . . . . . . . . . . . . . . . 52

8.2.9 K-Means Pro cess (View Clusters) . . . . . . . . . . . . . . . . . . . . 53
8.2.10 Home-Do cumen t Clustering . . . . . . . . . . . . . . . . . . . . . . . 54
8.2.11 Similarit y V ectors Measuremen t . . . . . . . . . . . . . . . . . . . . . 55
8.2.12 Finding Dissimilarit y . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.2.13 Outlier Do cumen t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.2.14 Clustered Do cumen t (Get Do cumen t) . . . . . . . . . . . . . . . . . . 62
8.2.15 Clustered Do cumen t (Obtain Outlier Do cumen t) . . . . . . . . . . . 63
8.2.16 Outlier Do cumen t Mo v emen t . . . . . . . . . . . . . . . . . . . . . . 64
8.2.17 Home-Do cumen t Clustering . . . . . . . . . . . . . . . . . . . . . . . 65
8.2.18 Chec king Purit y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.2.19 Home-Do cumen t Clustering . . . . . . . . . . . . . . . . . . . . . . . 67
8.2.20 Clustering A ccuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.2.21 Home-Do cumen t Clustering . . . . . . . . . . . . . . . . . . . . . . . 69
8.2.22 Executing Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9 TEST SPECIFICA TION 71
9.1 INTR ODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.1.1 Goals and Ob jectiv es . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
9.2 F ORMAL TECHNICAL REVIEWS . . . . . . . . . . . . . . . . . . . . . . 71
9.3 TEST PLAN AND SCHEDULE . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.3.1 T est Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 72
9.3.2 T esting Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9.3.3 T est Sc hedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74
9.4 TESTING TOOLS AND ENVIR ONMENT . . . . . . . . . . . . . . . . . . 74
9.4.1 T est w ork pro ducts . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.4.2 Soft w are to b e tested . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.5 TEST CASES AND RESUL T . . . . . . . . . . . . . . . . . . . . . . . . . . 76
9.6 RESUL T ANAL YSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
9.7 ANAL YSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
Conclusion 79
App endix 80
A LIST OF PUBLICA TIONS/CONFERENCES 80
B MA THEMA TICAL MODEL 81

C c-PGCON-2015 CER TIFICA TES 83
References 84

List of Figures
5.1 Gan tt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
6.1 Use case diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
6.2 DFD 0 diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
6.3 DFD 1 diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.4 DFD 2 diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
6.5 F unctional Flo w Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
7.1 System Arc hitecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
7.2 Class Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
7.3 Sequence Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
7.4 Collab oration Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
7.5 Comp onen t Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
7.6 Deplo ymen t Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39
8.1 Home-Do cumen t Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
8.2 Prepro cess Do cumen t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
8.3 Prepro cess Do cumen t (SELECT) . . . . . . . . . . . . . . . . . . . . . . . . 46
8.4 Dialog Bo x . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
8.5 Prepro cess Do cumen t (REMO VE) . . . . . . . . . . . . . . . . . . . . . . . . 48
8.6 Prepro cess Do cumen t (STEMMING) . . . . . . . . . . . . . . . . . . . . . . 49
8.7 Cluster Pro cess . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50
8.8 K-Means Pro cess (T erm F requency) . . . . . . . . . . . . . . . . . . . . . . . 51
8.9 K-Means Pro cess (Cen troid & Distance) . . . . . . . . . . . . . . . . . . . . 52
8.10 K-Means Pro cess (View Clusters) . . . . . . . . . . . . . . . . . . . . . . . . 53
8.11 Home-Do cumen t Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
8.12 Similarit y V ectors Measuremen t for cluster 1 . . . . . . . . . . . . . . . . . . 55
8.13 Similarit y V ectors Measuremen t for cluster 2 . . . . . . . . . . . . . . . . . . 56
8.14 Similarit y V ectors Measuremen t for cluster 3 . . . . . . . . . . . . . . . . . . 57

8.15 Similarit y V ectors Measuremen t for cluster 4 . . . . . . . . . . . . . . . . . . 58
8.16 Finding Dissimilarit y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
8.17 Outlier Do cumen t . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
8.18 Clustered Do cumen t (Get Do cumen t) . . . . . . . . . . . . . . . . . . . . . . 62
8.19 Clustered Do cumen t (Obtain Outlier Do cumen t) . . . . . . . . . . . . . . . . 63
8.20 Outlier Do cumen t Mo v emen t . . . . . . . . . . . . . . . . . . . . . . . . . . 64
8.21 Home-Do cumen t Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
8.22 Chec king Purit y . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
8.23 Home-Do cumen t Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 67
8.24 Clustering A ccuracy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
8.25 Home-Do cumen t Clustering . . . . . . . . . . . . . . . . . . . . . . . . . . . 69
8.26 Executing Query . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
9.1 Mathematical Mo del . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82

List of T ables
5.1 Dissertation Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
9.1 T est sc hedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
9.2 T est Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

LIST OF PUBLICA TIONS/CONFERENCES
1. Priti B. Kudal,Prof. M.M.Naoghare,A Review of Mo dern Do cumen t Clustering T ec h-
niques,In ternational Journal of Science & Researc h(IJSR), ISSN (Online): 2319-7064,
Impact F actor (2012): 3.358 ,V olume 3 Issue 10, Octob er 2014.
2. Priti B. Kudal, Prof. ManishaNaoghare, An Impro v ed Hierarc hical T ec hnique for Do c-
umen t Clustering, In ternational Journal of Science & Researc h(IJSR),ISSN (Online):
2319-7064, Impact F actor (2012): 3.358 V olume 4 Issue 4, April 2015.
3. Priti B. Kudal, Prof. ManishaNaoghare,An Impro v ed T ec hnique for Do cumen t Clus-
tering, In ternational Journal of T ec hnical Researc h and Applications e-ISSN:2320-
8163,V olume 3, Issue 3 (Ma y-June 2015), PP . 169-172.
4. Priti B. Kudal, Prof. ManishaNaoghare,An Ecien t T ec hnique for Do cumen t Clus-
tering, In ternational Conference on T ec hnological Inno v ations through Mo dern Engi-
neering Sciences, Ma y 16, 2015 at Shatab di Institute of T ec hnology , Agaskhind,Nashik.

Chapter 1
SYNOPSIS
DISSER T A TION TITLE
An Ecien t T ec hnique for Do cumen t Clustering.
INTERNAL GUIDE
Prof. M.M.Naoghare
TECHNICAL KEYW ORDS
Data Mining, Clustering, Similarit y Measure, Cosine Similarit y T erm F requency .
1.1 OBJECTIVES
The ob jectiv e is to dev elop a more ecien t do cumen t clustering algorithm in comparison
to the existing metho d. The clustering accuracy of the prop osed metho d will b e b etter as
compared to the existing metho d.
1.2 MOTIV A TION
As w eb is gro wing larger da y b y da y and with the gro wth of so cial net w orking, the
do cumen ts collected from the w eb are b ecoming more and more condensed. Suc h data due
to the sparseness imp oses new c hallenges in applying clustering tec hniques. Clustering suc h
sparse data could lead to new trends in w eb searc h and other suc h applications.
Clustering of do cumen t is imp ortan t for the purp ose of do cumen t organization, summa-
rization, topic extraction and information retriev al in an ecien t w a y . Initially , clustering
1

A n Ecient T e chnique for Do cument Clustering
is applied for enhancing the information retriev al tec hniques. Of late, clustering tec hniques
ha v e b een applied in the areas whic h in v olv e bro wsing the gathered data or in categorizing
the outcome pro vided b y the searc h engines for the reply to the query raised b y the users.
The main motiv ation of this w ork has b een to in v estigate p ossibilities for the im-
pro v emen t of the eectiv eness of do cumen t clustering b y nding out the main reasons of
ineectiv eness of the already built algorithms and get their solutions.
1.3 HYPOTHESIS
Do cumen t clustering is b ecoming more and more imp ortan t with the abundance of
text do cumen ts a v ailable through W orld Wide W eb and corp orate do cumen t managemen t
systems. But there are still some ma jor dra wbac ks in the existing text clustering tec hniques
that greatly aect their practical applicabilit y .
The dra wbac ks in the existing clustering approac hes are listed b elo w:
T ext clustering that yields a clear cut output has got to b e the most fa v orable. Ho w ev er,
do cumen ts can b e regarded dieren tly b y p eople with dieren t needs vis-à-vis the
clustering of texts. F or example, a businessman lo oks at business do cumen ts not in
the same w a y as a tec hnologist sees them (Macsk assy et al. 1998). So clustering tasks
dep end on in trinsic parameters that mak e w a y for a div ersit y of views.
T ext clustering is a clustering task in a high-dimensional space, where eac h w ord is
seen as an imp ortan t attribute for a text. Empirical and mathematical analysis ha v e
rev ealed that clustering in high-dimensional spaces is v ery complex, as ev ery data p oin t
is lik ely to ha v e the same distance from all the other data p oin ts (Bey er et al. 1999).
Desirable Prop erties of a Clustering Algorithm:
Scalabilit y (in terms of b oth time and space)
Abilit y to deal with dieren t data t yp es
Able to deal with noise and outliers
Insensitiv e to order of input records
Incorp oration of user-sp ecied constrain ts
In terpretabilit y and usabilit y
2
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
There are only a few studies rep orting the use of clustering algorithms in the Computer
F orensics eld. Essen tially , most of the studies describ e the use of classic algorithms for
clustering data e.g., Exp ectation-Maximization (EM) for unsup ervised learning of Gaussian
Mixture Mo dels, K-means, F uzzy C-means (F CM), and Self-Organizing Maps (SOM).
These algorithms ha v e w ell-kno wn prop erties and are widely used in practice.
F or instance, K-means and F CM can b e seen as particular cases of EM [8]. Algorithms
lik e SOM [9], in their turn, generally ha v e inductiv e biases similar to K-means, but are
usually less computationally ecien t.
1.4 STRA TEGY PLANNED ASSOCIA TED WITH THE DISSER T A TION
Generally , data mining (sometimes called data or kno wledge disco v ery) is the pro cess
of analyzing data from dieren t p ersp ectiv es and summarizing it in to useful information –
information that can b e used to increase rev en ue, cuts costs, or b oth.Information Retriev al
(IR) (Baeza 1992) is the eld of computer science that fo cuses on the pro cessing of do cumen ts
in suc h a w a y that the do cumen t can b e quic kly retriev ed based on k eyw ords sp ecied in a
users query . IR tec hnology is the foundation of w eb-based searc h engines and pla ys a k ey
role in biomedical researc h, as it is the basis of soft w are that aids literature searc h.Num b er of
clustering algorithms is a v ailable to da y , and the most commonly used algorithm is K-Means.
But its sensitiv e to outlier. So incremen tal K-Mean is designed to o v ercome it and can b e
used in dieren t application e.g. news group application. So that only related do cumen ts
can b e view ed for a red query and that will reduce resp onse time.
1.5 REVIEW OF CONFERENCE/JOURNAL P APERS OF SUPPOR TING
DISSER T A TION IDEA
There are only a few studies rep orting the use of clustering algorithms in the Computer
F orensics eld. Essen tially , most of the studies describ e the use of classic algorithms for
clustering data e.g., Exp ectation-Maximization (EM), K-means, F uzzy C-means (F CM),
and Self-Organizing Maps (SOM). These algorithms ha v e w ell-kno wn prop erties and are
widely used in practice. F or instance, K-means and F CM can b e seen as particular cases of
EM. The underlying assumption is that the clustered results can increase the information
retriev al eciency , b ecause it w ould not b e necessary to review all the do cumen ts found b y
the user an ymore.
3
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Clustering Metho ds
Common clustering tec hniques come under t w o broad categories:
1. P artitional
2. Hierarc hical
1. P artitional
K-Means
K-means is the most imp ortan t at clustering algorithm. The ob jectiv e
function of K-means is to minimize the a v erage squared distance of ob jects
from their cluster cen ters, where a cluster cen ter is dened as the mean or
cen troid µ of the ob jects in a cluster C.
The ideal cluster in K-means is a sphere with the cen troid as its cen ter
of gra vit y . Ideally , the clusters should not o v erlap. A measure of ho w w ell
the cen troids represen t the mem b ers of their clusters is the Residual Sum of
Squares (RSS), the squared distance of eac h v ector from its cen troid summed
o v er all v ectors.
Construct v arious partitions and then ev aluate them b y some criterion.
Nonhierarc hical, eac h instance is placed in exactly one of K non-o v erlapping
clusters. Since only one set of clusters is output, the user normally has to
input the desired n um b er of Clusters K.
Commen ts on K-Means:
Strength:
Simple, easy to implemen t and debug
In tuitiv e ob jectiv e function: optimizes in tra-cluster similarit y
Relativ ely ecien t: O(tkn), where n is # ob jects, k is # clusters, and
t is # iterations. Normally , k, t<<n .
W eakness:
Applicable only when mean is dened, then what ab out categorical
data?
Often terminates at a lo cal optim um. Initialization is imp ortan t.
Need to sp ecify K, the n um b er of clusters, in adv ance.
Unable to handle noisy data and outliers.
4
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Exp ectation Maximization
The EM algorithm fall within a sub category of the at clustering algo-
rithms, called Mo del-based clustering. The mo del-based clustering assumes
that data w ere generated b y a mo del and then tries to reco v er the original
mo del from the data. This mo del then denes clusters and the cluster mem-
b ership of data. The EM algorithm is a generalization of K-Means algorithm
in whic h the set of K cen troids as the mo del that generate the data. It al-
ternates b et w een an exp ectation step, corresp onding to reassignmen t, and a
maximization step, corresp onding to recomputation of the parameters of the
mo del.
2. Hierarc hical:
Hierarc hical clustering approac hes attempt to create a hierarc hical decom-
p osition of the giv en do cumen t collection th us ac hieving a hierarc hical structure.
Hierarc hical metho ds are usually classied in to Agglomerativ e and Divisiv e meth-
o ds dep ending on ho w the hierarc h y is constructed. Agglomerativ e metho ds start
with an initial clustering of the term space, where all do cumen ts are considered
represen ting a separate cluster.
Bottom-Up (agglomerativ e): Starting with eac h item in its o wn cluster,nd
the b est pair to merge in to a new cluster. Rep eat un til all clusters arefused
together.
T op-Do wn (divisiv e): Starting with all the data in a single cluster, consider
ev ery p ossible w a y to divide the cluster in to t w o. Cho ose the b est division
and recursiv ely op erate on b oth sides.
Summary of Hierarc hal Clustering Metho ds:
No need to sp ecify the n um b er of clusters in adv ance.
Hierarc hical structure maps nicely on to h uman in tuition for some do-
mains.
They do not scale w ell: time complexit y of at least O( n2), where n is the
n um b er of total ob jects.
Lik e an y heuristic searc h algorithms, lo cal optima are a problem.
In terpretation of results is (v ery) sub jectiv e.
1.6 PLAN OF DISSER T A TION EXECUTION
En tire w ork is divided in to follo wing tasks to facilitate dissertation executions:
5
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Phase-1 Cho ose Researc h Area
Phase-2 Preliminary Researc h
Phase-3 Decide Researc h T opic
Phase-4 Decide Metho dology
Phase-5 Submit or presen t prop osal
Phase-6 Finalize metho d
Phase-7 Conduct Researc h
Phase-8 Analyze Data
Phase-9 Rep ort W riting
Phase-10 Submitting Rep ort
1.7 PR OBLEM ST A TEMENT
The problem statemen t of m y w ork is that the qualit y of do cumen t clustering can b e im-
pro v ed b y applying some new clustering tec hniques. Firstly the eectiv eness of pre-pro cessing
of data represen tation will b e studied and then the classical clustering metho ds will b e ap-
plied. An eectiv e new clustering algorithm will b e dev elop ed in whic h the noise is reduced
b y rst clustering the features of the data and then clustering the data on the basis of their
features clusters.
1.8 PR OBLEM SOL VING APPR O A CH
An impro v ed tec hnique for do cumen t clustering is dev elop ed to o v ercome the dra wbac k of
K-Means. Both K-Means and Incremen tal K-Means algorithm are implemen ted and applied
on a v ailable dataset.Comparison of b oth algorithms is sho wn.Hereb y , I summarize the basic
applications in whic h prop osed concept can b e used:
1. Finding Similar Do cumen ts:
This feature is often used when the user has sp otted one go o d do cumen t in a
searc h result and w an ts more-lik e-this. The in teresting prop ert y here is that clustering
is able to disco v er do cumen ts that are conceptually alik e in con trast to searc h-based
approac hes that are only able to disco v er whether the do cumen ts share man y of the
same w ords.
6
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
2. Organizing Large Do cumen t Collections:
Do cumen t retriev al fo cuses on nding do cumen ts relev an t to a particular query ,
but it fails to solv e the problem of making sense of a large n um b er of uncategorized
do cumen ts. The c hallenge here is to organize these do cumen ts in a taxonom y iden tical
to the one h umans w ould create giv en enough time and use it as a bro wsing in terface
to the original collection of do cumen ts.
3. Duplicate Con ten t Detection:
In man y applications there is a need to nd duplicates or near-duplicates in a large
n um b er of do cumen ts. Clustering is emplo y ed for plagiarism detection, grouping of
related news stories and to reorder searc h results rankings (to assure higher div ersit y
among the topmost do cumen ts). Note that in suc h applications the description of
clusters is rarely needed.
4. Searc h Optimization:
Clustering helps a lot in impro ving the qualit y and eciency of searc h engines as
the user query can b e rst compared to the clusters instead of comparing it directly .
1.9 EFFICIENCY ISSUES
Though K-means clustering has some strong adv an tages, there are some eciency issues
with it.K-means clustering do es not w ork w ell with clusters (in the original data) of Dieren t
size and Dieren t densit y . This algorithm is sensitiv e to outliers, since an ob ject with an
extremely large v alue ma y substan tially distort the distribution of the data. T o o v ercome
the dra wbac ks of K-Means, Incremen tal algorithm is dev elop ed and an eciency of this
algorithm is b etter than that of the K-Means.
1.10 OUTCOMES
K-Means algorithm is dev elop ed.
Incremen tal algorithm is dev elop ed.
Comparison of b oth algorithms is sho wn.
Query is red and results are sho wn.
7
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

Chapter 2
TECHNICAL KEYW ORDS
2.1 CA TEGORIES AND SUBJECT DESCRIPTOR
CA TEGORIES
Do cumen t Clustering
Similarit y Measure
K-Means Algorithm
Incremen tal Algorithm
SUBJECT DESCRIPTOR
Data Mining
2.2 GENERAL TERMS
Data Mining, Information Retriev al, T ext Mining,F eatures,Query .
2.3 TECHNICAL KEYW ORDS
Data Mining
Data mining is an in terdisciplinary subeld of computer science. It is the computational
pro cess of disco v ering patterns in large data sets (big data) in v olving metho ds at the in-
tersection of articial in telligence, mac hine learning, statistics, and database systems.
Clustering
Clustering is the task of disco v ering groups and structures in the data that are in some w a y
or another similar, without using kno wn structures in the data.
8

A n Ecient T e chnique for Do cument Clustering
Classication
It is the task of generalizing kno wn structure to apply to new data. F or example, an e-mail
program migh t attempt to classify an e-mail as legitimate or as spam.
Similarit y Measure
It can represen t the similarit y b et w een t w o do cumen ts, t w o queries, or one do cumen t and
one query . It is a function whic h computes the degree of similarit y b et w een a pair of text
ob jects.
T erm F requency
It is the n um b er of times a term o ccurs in a do cumen t is called its term frequency .
9
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

Chapter 3
INTR ODUCTION
3.1 DISSER T A TION IDEA
Data mining is the pro cess of non-trivial disco v ery from implied, previously unkno wn,
and p oten tially useful information from data in large databases. Hence it is a core elemen t in
kno wledge disco v ery , often used synon ymously . Clustering, one of tec hnique for data mining
used for grouping similar terms together. Earlier statistical analysis used in text mining de-
p ends on term frequency .Then, new concept based text mining mo del w as in tro duced whic h
analyses terms. Clustering of do cumen t is useful for the purp ose of do cumen t organization,
summarization, and information retriev al in an ecien t w a y . Initially , clustering is applied
for enhancing the information retriev al tec hniques. Of late, clustering tec hniques ha v e b een
applied in the areas whic h in v olv e bro wsing the gathered data or in categorizing the outcome
pro vided b y the searc h engines for the reply to the query raised b y the users.
Clustering is sometimes erroneously referred to as automatic classication; ho w ev er, this
is inaccurate, since the clusters found are not kno wn prior to pro cessing whereas in case of
classication the classes are pre-dened.
In clustering, it is the distribution and the nature of data that will determine cluster
mem b ership, in opp osition to the classication where the classier learns the asso ciation
b et w een ob jects and classes from a so called training set, i.e. a set of data correctly lab eled
b y hand, and then replicates the learn t b eha vior on unlab eled data.
Do cumen t clustering is the pro cess of categorizing text do cumen t in to a systematic clus-
ter or group, suc h that the do cumen ts in the same cluster are similar whereas the do cumen ts
in the other clusters are dissimilar. It is one of the vital pro cesses in text mining. Liping
(2010) emphasized that the expansion of in ternet and computational pro cesses has pa v ed
the w a y for v arious clustering tec hniques. T ext mining esp ecially has gained a lot of imp or-
tance and it demands v arious tasks suc h as pro duction of gran ular taxonomies, do cumen t
summarization etc., for dev eloping a higher qualit y information from text.
10

A n Ecient T e chnique for Do cument Clustering
This system is pro viding b oth algorithms-K-Means, Incremen tal K-Means. Compar-
ison of b oth algorithm is done and use of this algorithm is sho wn.
3.2 MOTIV A TION OF DISSER T A TION
An application can b e dev elop ed for recommendations of news articles to the readers of
a news p ortal. The follo wing c hallenges ga v e me the motiv ation to use clustering of news
articles:
1. The n um b er of a v ailable articles is large.
2. A large n um b er of articles are added eac h da y .
3. Articles corresp onding to same news are added from dieren t sources.
4. The recommendations ha v e to b e generated and up dated in real time.
By clustering the articles I could reduce our domain of searc h for recommendations
as most of the users had in terest in the news corresp onding to a few n um b er of clusters.
This impro v ed our time eciency to a great exten t. Also I could iden tify the articles of
same news from dieren t sources. The main motiv ation of this w ork has b een to in v estigate
p ossibilities for the impro v emen t of the eectiv eness of do cumen t clustering b y nding out
the main reasons of ineectiv eness of the already built algorithms and get their solutions.
3.3 LITERA TURE SUR VEY
It is imp ortan t to emphasize that getting from a collection of do cumen ts to a clustering
of the collection, is not merely a single op eration, but is more a pro cess in m ultiple stages.
These stages include more traditional information retriev al op erations suc h as cra wling, in-
dexing, w eigh ting, ltering etc. Some of these other pro cesses are cen tral to the qualit y and
p erformance of most clustering algorithms, and it is th us necessary to consider these stages
together with a giv en clustering algorithm to harness its true p oten tial. I will giv e a brief
o v erview of the clustering pro cess, b efore I b egin our literature study and analysis.
There are only a few studies rep orting the use of clustering algorithms in the Computer
F orensics eld. Essen tially , most of the studies describ e the use of classic algorithms for
clustering data e.g., Exp ectation-Maximization (EM) for unsup ervised learning of Gaussian
Mixture Mo dels, K-means, F uzzy C-means (F CM), and Self-Organizing Maps (SOM).
These algorithms ha v e w ell-kno wn prop erties and are widely used in practice. F or in-
stance, K-means and F CM can b e seen as particular cases of EM. Algorithms lik e SOM, in
11
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
their turn, generally ha v e inductiv e biases similar to K-means, but are usually less compu-
tationally ecien t.
SOM-based algorithms w ere used for clustering les with the aim of making the decision-
making pro cess p erformed b y the examiners more ecien t. The les w ere clustered b y taking
in to accoun t their creation dates/times and their extensions.The underlying assumption is
that the clustered results can increase the information retriev al eciency , b ecause it w ould
not b e necessary to review all the do cumen ts found b y the user an ymore.
K-Means is the most imp ortan t at clustering algorithm. The ob jectiv e function of
K-means is to minimize the a v erage squared distance of ob jects from their cluster cen ters,
where a cluster cen ter is dened as the mean or cen troid of the ob jects in a cluster C:
The ideal cluster in K-means is a sphere with the cen troid as its cen ter of gra vit y . Ideally ,
the clusters should not o v erlap. A measure of ho w w ell the cen troids represen t the mem b ers
of their clusters is the Residual Sum of Squares (RSS), the squared distance of eac h v ector
from its cen troid summed o v er all v ectors.
K-means can start with selecting as initial clusters cen ters K randomly c hosen ob jects,
namely the seeds. It then mo v es the cluster cen ters around in space in order to minimize
RSS. This is done iterativ ely b y rep eating t w o steps un til a stopping criterion is met.
1. Reassigning ob jects to the cluster with closest cen troid.
2. Recomputing eac h cen troid based on the curren t mem b ers of its cluster.
I can use one of the follo wing termination conditions as stopping criterion
A xed n um b er of iterations I has b een completed.
Cen troids do not c hange b et w een iterations.
T erminate when RSS falls b elo w a pre-established threshold.
Comparison of V arious Systems:
K-means is the most imp ortan t at clustering algorithm. The ob jectiv e function of
K-means is to minimize the a v erage squared distance of ob jects from their cluster cen ters,
where a cluster cen ter is dened as the mean or cen troid of the ob jects in a cluster C:
K-means has problems when clusters are of diering Sizes, Densities,Non-globular
Shap es.
Problems with outliers
Empt y clusters
12
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
The K Nearest Neigh b our( K-NN) suers from the follo wing dra wbac ks:
Because all the w ork is done at run-time, k-NN can ha v e p o or run-time p erformance
if the training set is large.
k-NN is v ery sensitiv e to irrelev an t or redundan t features b ecause all features con tribute
to the similarit y and th us to the classication. This can b e ameliorated b y careful
feature selection or feature w eigh ting.
On v ery dicult classication tasks, k-NN ma y b e outp erformed b y more exotic tec h-
niques suc h as Supp ort V ector Mac hines or Neural Net w orks.
Do cumen t clustering is the pro cess of categorizing text do cumen t in to a systematic
cluster or group, suc h that the do cumen ts in the same cluster are similar whereas the do c-
umen ts in the other clusters are dissimilar. It is one of the vital pro cesses in text mining.
Liping (2010) emphasized that the expansion of in ternet and computational pro cesses has
pa v ed the w a y for v arious clustering tec hniques. T ext mining esp ecially has gained a lot of
imp ortance and it demands v arious tasks suc h as pro duction of gran ular taxonomies, do cu-
men t summarization etc., for dev eloping a higher qualit y information from text.
Lik aset al. (2003) prop osed the global K-means clustering tec hnique that creates initial
cen ters b y recursiv ely dividing data space in to disjoin ted subspaces using the K-dimensional
tree approac h. The cutting h yp er plane used in this approac h is the plane that is p erp en-
dicular to the maxim um v ariance axis deriv ed b y Principal Comp onen t Analysis (PCA).
P artitioning w as carried out as far as eac h of the leaf no des p ossess less than a predened
n um b er of data instances or the predened n um b er of buc k ets has b een generated. The ini-
tial cen ter for K-means is the cen troids of data that are presen t in the nal buc k ets. Shehroz
Khan and Amir Ahmad (2004) stipulated iterativ e clustering tec hniques to calculate initial
cluster cen ters for K-means. This pro cess is feasible for clustering tec hniques for con tin uous
data.
Agra w alet al. (2005) ascrib ed data mining applications and their v arious requiremen ts
on clustering tec hniques. The main requiremen ts considered are their abilit y to iden tify clus-
ters em b edded in subspaces. The subspaces con tain high dimensional data and scalabilit y .
They also consist of the comprehensible abilit y of results b y end-users and distribution of
unpredictable data transfer.
The main limitation of K-means approac h is that it generates empt y clusters based on
initial cen ter v ectors. Ho w ev er, this dra wbac k do es not cause an y signican t problem for
static execution of K-means algorithm and the problem can b e o v ercome b y executing K-
means algorithm for a n um b er of times. Ho w ev er, in a few applications, the cluster issue
13
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
p oses problems of erratic b eha vior of the system and aects the o v erall p erformance. Mala y
P akhira et al. (2009) mo oted a mo died v ersion of the K-means algorithm that eectiv ely
eradicates this empt y cluster problem. In fact, in the exp erimen ts done in this regard, this
algorithm sho w ed b etter p erformance than that of traditional metho ds.
Uncertain t y heterogeneous data streams (Charu Aggarw alet .al 2003) are seen in most
of the applications. But the clustering qualit y of the existing approac hes for clustering het-
erogeneous data streams with uncertain t y is not satisfactory . Guo-Y an Huang et al. (2010)
p osited an approac h for clustering heterogeneous data streams with uncertain t y . A frequency
histogram using H-UCF helps to trace c haracteristic categorical statistic. Initially , creating
n clusters b y a K-protot yp e algorithm, the new approac h pro v es to b e more useful than
UMicro in regard to clustering qualit y .
Alamet al. (2010) designed a no v el clustering algorithm b y blending partitional and
hierarc hical clustering called HPSO. It utilized the sw arm in telligence of an ts in a decen tral-
ized en vironmen t. This algorithm pro v ed to b e v ery eectiv e as it p erformed clustering in a
hierarc hical manner.
Shin-Jy e Lee et al. (2010) suggested clustering-based metho d to iden tify the fuzzy sys-
tem. T o initiate the task, it tried to presen t a mo dular approac h, based on h ybrid clustering
tec hnique. Next, nding the n um b er and lo cation of clusters seemed the primary concerns
for ev olving suc h a mo del. So, taking input, output, generalization and sp ecialization, a
HCA has b een designed. This three-part input-output clustering algorithm adopts sev eral
clustering c haracteristics sim ultaneously to iden tify the problem.
Only a few researc hers ha v e fo cused atten tion on partitioning categorical data in an
incremen tal mo de. Designing an incremen tal clustering for categorical data is a vital issue.
Li T ao yinget al. (2010) len t supp ort to an incremen tal clustering for categorical data using
clustering ensem ble. They initially reduced redundan t attributes if required, and then made
use of true v alues of dieren t attributes to form clustering mem b erships.
Crescenziet al. (2004) cited an approac h that automatically extracts data from large
data-in tensiv e w eb sites. The data grabb er in v estigates a large w eb site and infers a
sc heme for it, describing it as a directed graph with no des. It describ es classes of struc-
turally similar pages and arcs represen ting links b et w een these pages. After lo cating the
classes of in terest, a library of wrapp ers can b e created, one p er class with the help of an
external wrapp er generator and in this w a y suitable data can b e extracted.
MihaGrcaret al. (2008) m ulled o v er a tec hnique ab out the lac k of soft w are mining tec h-
nique, whic h is a pro cess of extracting kno wledge out of source co de. They presen ted a
soft w are mining mission with an in tegration of text mining and link study tec hnique. This
tec hnique is concerned with the in ter links b et w een instances. Retriev al and kno wledge
14
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
based approac hes are the t w o main tasks used in constructing a to ol for soft w are comp o-
nen t. An on tology-learning framew ork named LA TINO w as dev elop ed b y Grcaret al. (2006).
LA TINO, an op en source purp ose data mining platform, oers text mining, link analysis,
mac hine learning, etc.
Similarit y-based approac h and mo del-based approac hes (Meila and Hec k erman 2001) are
the t w o ma jor categories of clustering approac hes and these ha v e b een describ ed b y P alla v
Ro xy and DurgaT oshniw al (2009). The former, capable of maximizing a v erage similarities
within clusters and minimizing the same among clusters, is a pairwise similarit y clustering
approac h. The latter tries to generate tec hniques from the do cumen ts, eac h approac h rep-
resen ting one do cumen t group in particular.
Do cumen t clustering is b ecoming more and more imp ortan t with the abundance of text
do cumen ts a v ailable through W orld Wide W eb and corp orate do cumen t managemen t sys-
tems. But there are still some ma jor dra wbac ks in the existing text clustering tec hniques
that greatly aect their practical applicabilit y .
15
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

Chapter 4
PR OBLEM DEFINITION AND SCOPE
4.1 PR OBLEM ST A TEMENT
This dissertation presen ts the clustering tec hniques, comparison of existing clustering
tec hnique i.e. K-Means and the new incremen tal K-Means.The qualit y of do cumen t cluster-
ing can b e impro v ed b y applying some new clustering tec hniques. The eectiv eness of a new
clustering algorithm in whic h the noise is reduced b y rst clustering the features of the data
and then clustering the data on the basis of their features clusters.
4.2 GO ALS AND OBJECTIVES
Goals:
T o dev elop an ecien t tec hnique for do cumen t clustering.
Ob jectiv es:
T o minimize in tra-cluster distances b et w een do cumen ts, while maximizing in ter-cluster
distances.
T o dev elop a more ecien t do cumen t clustering algorithm in comparison to the existing
metho d. The clustering accuracy of the prop osed metho d will b e b etter as compared
to the existing metho d.
T o compare existing do cumen t clustering tec hniques with the new one.
T o execute query and sho w the results.
4.3 ST A TEMENT OF SCOPE
Dev elop an application for recommendations of news articles to the readers of a news
p ortal. The follo wing c hallenges ga v e us the motiv ation to use clustering of news articles:
16

A n Ecient T e chnique for Do cument Clustering
The n um b er of a v ailable articles w as large. A large n um b er of articles w ere added eac h
da y .
1. Articles corresp onding to same news w ere added from dieren t sources.
2. The recommendations had to b e generated and up dated in real time.
By clustering the articles I could reduce our domain of searc h for recommendations
as most of the users had in terest in the news corresp onding to a few n um b er of clusters.
This impro v ed our time eciency to a great exten t. Also I could iden tify the articles of
same news from dieren t sources. The main motiv ation of this w ork has b een to in v estigate
p ossibilities for the impro v emen t of the eectiv eness of do cumen t clustering b y nding out
the main reasons of ineectiv eness of the already built algorithms and get their solutions.
4.4 SOFTW ARE CONTEXT
Language : Ja v a 1.5.0
Database : MySQL 5.1.36
W eb Serv er: Apac he 2.2.11
T o ol : Netb eans IDE 7.0.1
Hardw are : P ersonal Computer with minim um conguration
Do cumen tation T o ols : LaT ex
4.5 MAJOR CONSTRAINT
This system pro vides new clustering algorithm for do cumen t clustering.
Need Dataset for do cumen t clustering.
Num b er of cluster has to b e dened in adv ance.
Prepro cessing of do cumen t needs to b e done.
4.6 APPR O A CHES F OR SOL VING THE PR OBLEMS AND EFFICIENCY
ISSUES
Refer Chapter 1 section 1.9, 1.10
17
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
4.7 OUTCOMES
Refer Chapter 1 section 1.11 for details.
4.8 HARD W ARE RESOUR CES REQUIRED
Pro cessor – P en tium IV
RAM – 256 MB(min)
Hard Disk – 20 GB
Monitor – SV GA
4.9 SOFTW ARE RESOUR CES REQUIRED
Op erating System – Windo ws XP
Programming Language – JA V A
Ja v a V ersion – JDK 1.6 & ab o v e.
Netb ean IDE – 7.0.1
Database – MySQL 5.1.36
W eb Serv er – Apac he 2.2.11
Do cumen tation T o ol – LaT ex
4.10 AREA OF DISSER T A TION
Data Mining
18
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

Chapter 5
DISSER T A TION PLAN
5.1 PURPOSE
List of tasks along with the sc hedule is sho wn in T able 5.2 and corresp onding Gan tt
c hart in Fig. 5.1. This c hart mak es it simple to get status of pro ject.
5.2 DISSER T A TION PLAN
Pro ject w as divided in to 10 phases:
Phase T ask Prop osed Duration
Phase 1 Cho ose Researc h Area 2 W eek
Phase 2 Preliminary Researc h 8 W eek
Phase 3 Decide Researc h T opic 4 W eek
Phase 4 Decide Metho dology 4 W eek
Phase 5 Submit prop osal 2 W eek
Phase 6 Finalize metho d 4 W eek
Phase 7 Conduct Researc h 10 W eek
Phase 8 Analyze Data 10 W eek
Phase 9 Rep ort W riting 8 W eek
Phase 10 Submitting Rep ort 4 W eek
T able 5.1: Dissertation Plan
19

A n Ecient T e chnique for Do cument Clustering
5.3 GANTT CHAR T
Figure 5.1: Gan tt Chart
20
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

Chapter 6
SOFTW ARE REQUIREMENT SPECIFICA TION
6.1 INTR ODUCTION
6.1.1 Purp ose And Scop e Of Do cumen t
Soft w are requiremen t sp ecication deals with all soft w are p ersp ectiv es for the disserta-
tion. It discusses the use classes, use cases and usage consideration for them. It also fo cuses
on data mo deling, functional mo deling along with soft w are p erformance requiremen ts.
The purp ose of this activit y is to iden tify the functional and non-functional requiremen ts
for the soft w are system to build. The ob jectiv e of this do cumen t is to fully understand the
user requiremen t and determine ho w w ell the system is p erforming and whether it will meet
the demands.
6.1.2 Ov erview Of Resp onsibilities Of Dev elop er
Understand Problem Denition of the system.
Understand users exp ectations from the system.
Gather requiremen ts for system.
Analyze requiremen ts and design mo del.
Analyze whic h are the pro cesses in v olv ed.
Ecien t co ding with use of appropriate data structure and algorithms.
Analyze what data are used or pro duced during the pro cesses.
Complete the pro ject successfully within giv en time.
Dev elop ers resp onsibilities with resp ect to this system:
T o handle large n um b er of a v ailable dataset.
21

A n Ecient T e chnique for Do cument Clustering
T o dev elop and apply K-Means algorithm on a v ailable dataset.
T o dev elop and apply Incremen tal K-Means algorithm on a v ailable dataset.
T o compare b oth algorithm considering purit y of clustering.
T o execute query and displa y result.
6.2 PR ODUCT O VER VIEW
Clustering algorithms ha v e b een studied for decades, and the literature on the sub ject
is h uge. Here, the basic prepro cessing of do cumen t is done. Existing Clustering algorithm is
applied. New clustering algorithm is dev elop ed to o v ercome the dra wbac ks of existing one.
Comparison of b oth algorithm is sho wn on the basis of the purit y of formed cluster. Query
is executed and results are ev aluated.
6.3 USA GE SCENARIO
6.3.1 User Prole
6.3.1.1 General User:
An y authorized user of the system can apply the clustering tec hnique on a v ailable data
set.
6.3.2 Use Cases
F ollo wing mo dules are a v ailable in a system:
1. Pre-Pro cessing Mo dule
2. Calculating the n um b er of clusters
3. Clustering tec hniques
4. Remo ving Outliers
6.3.3 Use Case View
A use case diagram is a t yp e of b eha vioral diagram dened b y the UML created from a
use case analysis. Its purp ose is to presen t a graphical o v erview of the functionalit y pro vided
b y a system in terms of actors, their goals represen ted as use case and an y dep endency
b et w een those use cases.
22
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Figure 6.1: Use case diagram
The Figure 6.1 sho ws the Use Case Diagram of Do cumen t Clustering System. It
sho ws the task p erformed b y the user and system.
6.4 D A T A MODEL AND DESCRIPTION
Dieren t data structures are used whic h are describ ed in detail in c hapter 7, section 7.3.
6.4.1 Data Description
This system prop ose a new ecien t tec hnique for do cumen t clustering whic h can b e
used in man y applications.
F ollo wing is the sequence for the data o w across dieren t mo dules:
1. Data from Dataset to prepro cessing mo dule.
23
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
2. After prepro cessing is done, terms are calculated.
3. Cen troid and distance calculation is done.
4. Cluster formation is done.
5. Similarit y calculation is done.
6. Query is red and results are ev aluated.
6.5 FUNCTIONAL MODEL AND DESCRIPTION
6.5.1 Flo w Diagram
A data o w Diagram is a graphical represen tation of all ma jor steps, and ho w the data
o ws through the system. The data o w diagrams for the system are sho wn in Figure 6.2,
Figure 6.3 and Figure 6.4.
Lev el 0 Data Flo w Diagram:
Figure 6.2: DFD 0 diagram
The DFD 0 Diagram as sho wn in Figure 6.2 giv es a o v erview of the Do cumen t
Clustering System. This diagram sho ws the basic 3 steps in v olv ed for clustering do cumen ts
that are:
1. Selecting Dataset and p erforming clustering on it.
2. Executing query on clustered do cumen ts.
3. Pro cessing Result.
24
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Lev el 1 Data Flo w Diagram:
Figure 6.3: DFD 1 diagram
The DFD 1 Diagram as sho wn in Figure 6.3 sho ws the detailed steps whic h are sho wn
in DFD 0.
Lev el 2 Data Flo w Diagram:
Figure 6.4: DFD 2 diagram
The DFD 2 Diagram as sho wn in Figure 6.4 sho ws the complete step wise pro cedure
of Do cumen t Clustering System. Steps are as follo ws:
1. Dataset is selected for clustering.
2. K-Means Clustering algorithm is applied on selected dataset.
25
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
3. Impro v ed Algorithm is applied on selected dataset.
4. Execute query on clustered do cumen ts.
5. Pro cess the result.
Description of F unction
A pro cessing narrativ e for functions is presen ted in activit y diagram as sho wn in Figure
6.5. Eac h function is describ ed in detail in Chapter 7, Section 7.2.
6.5.2 Non F unctional Requiremen t
A non-functional requiremen t is a requiremen t that sp ecies criteria that can b e used
to judge the op eration of a system rather than sp ecic b eha vior. This should b e con trasted
with functional requiremen ts that sp ecic b eha vior or functions.
A ccessibilit y:
This system is user friendly .
A ccuracy:
This system pro vides clusters with go o d accuracy is compared to existing one.
Eciency:
This new algorithm is highly ecien t as it giv es highly relev an t do cumen ts to the
sp ecied query .
6.5.3 F unctional Flo w Diagram
6.5.4 Soft w are Qualit y A ttributes
A daptabilit y:
This system is adaptable in an y kind of text mining en vironmen t.
Main tainabilit y and Flexibilit y:
This system supp orts main tainabilit y and exibilit y prop erties.
This system is main tainable b y k eeping trac k of dataset.
Only text les ha v e to b e there in dataset.
The system is capable of making c hanges in addition of dataset.
26
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Figure 6.5: F unctional Flo w Diagram
27
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Reusabilit y:
Soft w are reuse is the pro cess of implemen ting or up dating soft w are systems using existing
soft w are comp onen ts. A go o d soft w are reuse pro cess facilitates the increase of pro ductivit y ,
qualit y , and reliabilit y , and the decrease of costs and implemen tation time. An initial in-
v estmen t is required to start a soft w are reuse pro cess, but that in v estmen t pa ys for itself in
a few reuses. In short, the dev elopmen t of a reuse pro cess and rep ository pro duces a base
of kno wledge that impro v es in qualit y after ev ery reuse, minimizing the amoun t of dev elop-
men t w ork required for future pro jects, and ultimately reducing the risk of new pro jects that
are based on rep ository kno wledge. This system can b e reused in n um b er of text mining
applications.
6.5.5 Design Constrain ts
This system giv es follo wing design constrain ts:
1. Dataset has to b e sp ecied.
2. Num b er of cluster has to b e dened.
6.5.6 Soft w are In terface Description
In this system it pro vides a GUI for selecting datasets, applying K-Means and Incremen-
tal K-Means algorithm, comparing these t w o algorithms and executing query . It pro vides
user friendly en vironmen t to user, so that user can easily used it.
6.5.7 RESTRICTIONS, LIMIT A TIONS, AND CONSTRAINTS
There are a few constrain ts of the system:
The design should b e user friendly .
Data will b e group ed in to 4 clusters only .
6.6 V ALID A TION CRITERIA
6.6.1 Classes of tests
Refer Chapter 8 for Classes of test.
28
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
6.6.2 Estimated Soft w are resp onse
Estimation of KLOC:
The n um b er of lines required for implemen tation of v arious mo dules:
Eorts: E = 3:2(KLOC )^(1:02)
Eorts: E = 3:2(4:10)^(1:02)
Requiremen t Analysis required: 2 Mon ths
Dev elopmen t time for Implemen tation & T esting required(In mon ths):
D=E=N
D= 13:504=1
D= 13:5
D= 13:5 + 2 = 16months
29
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

Chapter 7
DET AILED DESIGN DOCUMENT
7.1 In tro duction
7.1.1 Ov erview of pro ducts
Clustering algorithms ha v e b een studied for decades, and the literature on the sub ject
is h uge. Here, the basic prepro cessing of do cumen t is done. Existing Clustering algorithm is
applied. New clustering algorithm is dev elop ed to o v ercome the dra wbac ks of existing one.
Comparison of b oth algorithm is sho wn on the basis of the purit y of formed cluster. Query
is executed and results are ev aluated.
7.2 AR CHITECHTURAL DESIGN
7.2.1 System Arc hitecture:
Clustering is sometimes erroneously referred to as automatic classication; ho w ev er, this
is inaccurate, since the clusters found are not kno wn prior to pro cessing whereas in case of
classication the classes are pre-dened. In clustering, it is the distribution and the nature
of data that will determine cluster mem b ership, in opp osition to the classication where the
classier learns the asso ciation b et w een ob jects and classes from a so called training set, i.e.
a set of data correctly lab eled b y hand, and then replicates the learn t b eha vior on unlab eled
data.
Do cumen t clustering is the pro cess of categorizing text do cumen t in to a systematic clus-
ter or group, suc h that the do cumen ts in the same cluster are similar whereas the do cumen ts
in the other clusters are dissimilar. It is one of the vital pro cesses in text mining. T ext mining
esp ecially has gained a lot of imp ortance and it demands v arious tasks suc h as pro duction
of gran ular taxonomies, do cumen t summarization etc., for dev eloping a higher qualit y infor-
mation from text.This system is pro viding b oth algorithms-K-Means, Incremen tal K-Means.
Comparison of b oth algorithms is done. Finally query is executed to sho w the result.
30

A n Ecient T e chnique for Do cument Clustering
Figure 7.1: System Arc hitecture
Figure 7.1 sho ws the complete o w of Do cumen t Clustering System. The system
w orks as follo ws:
1. Selected dataset will b e giv en as input to Prepro cessing Mo dule. In this step, stop
w ords will b e remo v ed and stemming will b e p erformed.
2. T erm F requency of eac h term will b e calculated and it will b e stored in Do cumen t
V ector Matrix format.
3. Similarit y will b e calculated amongst do cumen ts.
4. Clusters will b e formed. Here existing K-Means algorithm and new impro v ed algo-
rithm, b oth are applied to form the clusters. Comparison of b oth algorithms is sho wn
on purit y basis.
5. Query is executed. One term will b e selected to searc h and the result of the query
means the do cumen ts con taining that term will b e displa y ed.
7.3 SYSTEM MODEL
1. Prepro cessing Mo dule
Before running clustering algorithms on text datasets, I p erformed some pre-
pro cessing steps. In particular, stop w ords (prep ositions, pronouns, articles, and ir-
relev an t do cumen t metadata) ha v e b een remo v ed. Also, the Sno w balls stemming
algorithm for P ortuguese w ords has b een used. Then, I adopted a traditional statis-
tical approac h for text mining, in whic h do cumen ts are represen ted in a v ector space
mo del. In this mo del, eac h do cumen t is represen ted b y a v ector con taining the fre-
quencies of o ccurrences of w ords, whic h are dened as delimited alphab etic strings,
31
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
whose n um b er of c haracters is b et w een 4 and 25.
I ha v e also used a dimensionalit y reduction tec hnique kno wn as T erm V ariance
(TV) that can increase b oth the eectiv eness and eciency of clustering algorithms.
TV selects a n um b er of attributes (in our case 100 w ords) that ha v e the greatest v ari-
ances o v er the do cumen ts. In order to compute distances b et w een do cumen ts, t w o
measures ha v e b een used, namely: cosine-based distance and Lev ensh tein-based dis-
tance. The later has b een used to calculate distances b et w een le (do cumen t) names
only .
2. Calculating the n um b er of clusters
In order to estimate the n um b er of clusters, a widely used approac h consists of
getting a set of data partitions with dieren t n um b ers of clusters and then selecting that
particular partition that pro vides the b est result according to a sp ecic qualit y criterion
(e.g., a relativ e v alidit y index). Suc h a set of partitions ma y result directly from a
hierarc hical clustering dendrogram or, alternativ ely , from m ultiple runs of a partitional
algorithm (e.g., K-means) starting from dieren t n um b ers and initial p ositions of the
cluster protot yp es.
3. Clustering tec hniques
The clustering algorithms adopted in our study the partitional K-means and K-
medoids, the hierarc hical Single/Complete/A v erage Link, and the cluster ensem ble
based algorithm kno wn as CSP A are p opular in the mac hine learning and data min-
ing elds, and therefore they ha v e b een used in our study . Nev ertheless, some of m y
c hoices regarding their use deserv e further commen ts. F or instance, K-medoids is sim-
ilar to K-means. Ho w ev er, instead of computing cen troids, it uses medoids, whic h are
the represen tativ e ob jects of the clusters. This prop ert y mak es it particularly in ter-
esting for applications in whic h (i) cen troids cannot b e computed; and (ii) distances
b et w een pairs of ob jects are a v ailable, as for computing dissimilarities b et w een names
of do cumen ts with the Lev ensh tein distance.
4. Remo ving Outliers
I assess a simple approac h to remo v e outliers. This approac h mak es recursiv e use
of the silhouette. F undamen tally , if the b est partition c hosen b y the silhouette has
singletons (i.e., clusters formed b y a single ob ject only), these are remo v ed. Then, the
clustering pro cess is rep eated o v er and o v er again un til a partition with-out singletons
is found. A t the end of the pro cess, all singletons are incorp orated in to the resulting
data partition (for ev aluation purp oses) as single clusters.
32
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
7.4 COMPONENT DESIGN
7.4.1 Class diagrams
Figure 7.2: Class Diagram
Figure 7.2 sho ws the Class Diagram of Do cumen t Clustering T ec hnique.Num b er of
classes with dieren t metho ds are used here.
e.g. Stop w ords Remo v al Class ha v e stop w ordlist arra y whic h con tains 50 stop w ords.
Same class is ha ving stop w ordRemo v al() metho d to replace the stop w ord b y a blank space.
33
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
7.4.2 In teraction Diagrams
Sequence Diagram
A diagram sho wing the o w of information through the function and the transformation
it undergo es is presen ted in Fig. 7.3. A sequence diagram is graphical view of a scenario
that sho ws ob ject in teraction in time based sequence. What happ ens rst, what happ ens
next sequence diagram establishe the roles of ob ject and help pro vide essen tial information
to determine class resp onsibilities and in terface.
Figure 7.3: Sequence Diagram
34
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Figure 7.3 sho ws the Sequence Diagram of Do cumen t Clustering System. This dia-
gram sho ws the w a y dieren t mo dules comm unicate to eac h other.
1. Files will b e giv en as input for Prepro cessing.
2. Stop w ords Dictionary will b e giv en as input for prepro cessing mo dule. Here all stop
w ords will b e replaced b y a blank space and stemming will b e p erformed.
3. The output le of previous step con taining only terms will b e giv en as input for calcu-
lating coun t of that term.
4. T erms with coun t will b e giv en as input for forming the clusters. Before forming the
clusters, features are selected on whic h basis clusters ha v e to b e p erformed.
5. Clusters with terms and coun t are giv en for Query mo dule. Query is red and result is
pro cessed. One term is selected here to searc h and relev an t do cumen ts will b e returned
here.
35
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Collab oration Diagram:
Figure 7.4 sho ws the collab oration diagram of Do cumen t Clustering System.
Figure 7.4: Collab oration Diagram
It sho ws the collab oration b et w een comp onen ts as follo ws:
1. Selection of Dataset.
2. Dictionary , con taining stop w ords will b e used to remo v e the stop w ords.
3. Output after remo ving stop w ord will b e giv en as input for Prepro cessing. Here Stem-
ming & T ok enization will b e done.
4. T ok ens will b e passed for calculating term frequency that is the n um b er of o ccurrence
of eac h term.
5. Clusters will b e formed on the basis of decided feature sets.
6. After clusters are formed, query will b e red. Here term will b e selected to nd the
do cumen ts whic h are con taining this w ords with its rank and w eigh t.
36
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
7.4.3 Comp onen t Diagram
Figure 7.5: Comp onen t Diagram
Figure 7.5 sho ws the comp onen t diagram of Do cumen t Clustering System.
F ollo wing comp onen ts are used in this system:
1. Prepro cessing
(a) Stop W ord Remo v al
(b) Stemming
(c) T ok enization
2. Calculating T erm F requency
(a) Calculate T erm F requency
(b) Store in a v ector matrix
3. Clustering T ec hniques
37
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
(a) K-Means
(b) Impro v ed Algorithm
(c) Similarit y Calculation
(d) Finding Maxim um Dissimilarit y
4. Remo ving Outlier
(a) Obtain Outlier Do cumen t
(b) Mo v emen t of Outlier Do cumen t
5. Application
(a) Executing Query
38
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
7.4.4 Deplo ymen t Diagram
Figure 7.6: Deplo ymen t Diagram
Figure 7.6 sho ws the Deplo ymen t Diagram of Do cumen t Clustering System.
All the comp onen ts as men tioned in Comp onen t Diagram are deplo y ed or implemen ted
using:
1. JDK 1.5.0
2. MySQL 5.1.36
39
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
7.4.5 Algorithms
1. Prepro cessing Mo dule:
It includes the algorithm for:
(a) Stop W ord Remo v al
(b) Stemming
(a) Stop W ord Remo v al: Stop w ords (prep ositions, pronouns, articles, and irrel-
ev an t do cumen t metadata) ha v e b een remo v ed. A t ypical metho d to remo v e
stop w ords is to compare eac h term with acompilation of kno wn stop w ords
Input:A do cumen t Data Base D and List of Stop w ords L
D={d1,d2,d3,.,dk} ;where 1<=k<=i
tij is the jth term in ith do cumen t
Output: All v alid stem text term in D
for (all di in D) do
for (1 to j) do
Extract tij from di
If(tij in list L)
Remo v e tij from di
End for
End for
(b) Stemming: W ord stemming is the pro cess of sux remo v al to general w ord
stems. A stem is a natural group of w ords with similar meaning. The pro-
cess of reducing w ords to their base form, or stem. F or example, the w ords
connected,connection, connections are all reduced to the stem connect.
P orters algorithm is the de facto standard stemming algorithm.
Step 1: Gets rid of plurals and -ed or -ing suxes
Step 2: T urns terminal y to i when there is another v o w el in the stem
Step 3: Maps double suxes to single ones:-ization, -ational, etc.
Step 4: Deals with suxes, -full, -ness etc.
Step 5: T ak es o -an t, -ence, etc.
Step 6: Remo v es a nal -e
40
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
2. Similarit y Calculation
The similarit y b et w een t w o do cumen ts i & j is computed as follo ws:
F ori:= 0 to N(total do cumen ts)
F orj:= 0 to N (total do cumen ts)
Sim v alue := ((doc[i]doc[j]))=Math:sqrt (doc[i]doc[j])
A ddsim v alue to the list Build matrix;
Next
Next
Where N is total n um b er of do cumen ts
do c[i] for i=1,2…..n are do cumen ts
3. Clustering T ec hniques:
K means & impro v ed metho d are used.
Algorithm for K Means metho d:
This algorithm consists of four steps:
(a) Initialization: In this rst step, data set, n um b er of clusters and the cen troid
that I ha v e dened for eac h cluster.
(b) Classication: The distance is calculated for eac h data p oin t from the cen troid
and the data p oin t ha ving minim um distance from the cen trio d of a cluster is
assigned to that particular cluster.
(c) Cen troid Recalculation: Clusters generated previously , the cen trio d is again re-
p eatly calculated means recalculation of the cen trio d.
(d) Con v ergence Condition :Some con v ergence conditions are giv en as b elo w:
i. Stopping when reac hing a giv en or dened n um b er of iterations.
ii. Stopping when there is no exc hange of data p oin ts b et w een the clusters.
iii. Stopping when a threshold v alue is ac hiev ed.
(e) Rep eat : If all of the ab o v e conditions are not satised, then go to step 2 and
the whole pro cess rep eat again, un til the giv en conditions are not satised.
Algorithm for impro v ed metho d:
Output:D=d1;d2;d3;:::;di;:::;dn //set of do cumen ts
41
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
di=x1;x2;x3;:::;xi;:::;xm k // Num b er of desired clusters.
Input: A set of k clusters.
W orking Pro cedure:
1. Calculate distance for eac h do cumen t or data p oin t from the origin
2. Arrange the distance (obtained in step 1) in ascending order.
3. Split the sorted list in K equal size sub sets. Also the middle p oin t of eac h sub set is
tak en as the cen troid of that set.
4. Rep eat this step for all data p oin ts. No w the distance b et w een eac h data p oin t & all
the cen troids is calculated. Then the data set is assigned to the closest cluster.
5. In this step, the cen troids of all the clusters are recalculated.
6. No w for all data p oin ts. No w the distance b et w een eac h data p oin t & all the cen troids
is calculated. If this distance is less than or equal to the presen t nearest distance then
the data p oin t sta ys in the same cluster. Else it is shifted to the nearest new cluster.
7.5 PR OGRAM STR UCTURE
7.5.1 Arc hitecture diagram
Refer 7.2 for arc hitecture diagram.
7.5.2 DESCRIPTION F OR COMPONENT
Refer Detailed Design of this c hapter for description of comp onen ts.
42
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

Chapter 8
IMPLEMENT A TION
8.1 IMPLEMENT A TION
This system implemen ts existing and new algorithm. Both algorithms are compared on
the basis of purit y . Finally query is executed and results are displa y ed.
8.2 SNAP SHOTS
8.2.1 Home-Do cumen t Clustering
Figure 8.1 sho ws the Home screen of Do cumen t Clustering System. It is the rst screen
of Clustering System. This windo w consists of follo wing:
1. Selecting Dataset
2. Cluster
3. Clustering T ec hniques:
(a) K Mean
(b) Incremen tal Algorithm
4. Comparison of b oth the algorithms
5. Finally the application i.e. ho w and where it can b e used.
After clic king on Select Dataset button, the next windo w will app ear.
43

A n Ecient T e chnique for Do cument Clustering
Figure 8.1: Home-Do cumen t Clustering
44
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.2 Prepro cess Do cumen t
Figure 8.2: Prepro cess Do cumen t
Figure 8.2 sho ws the Prepro cess Do cumen t windo w.
45
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.3 Prepro cess Do cumen t (SELECT)
Figure 8.3: Prepro cess Do cumen t (SELECT)
Figure 8.3 sho ws the Prepro cess Do cumen t windo w. After clic king on SELECT
button, Op en Dialog Bo x (Figure 8.4) will b e displa y ed to select Dataset means the do cu-
men ts whic h ha v e to b e clustered.
46
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Figure 8.4: Dialog Bo x
47
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.4 Prepro cess Do cumen t (REMO VE)
Figure 8.5: Prepro cess Do cumen t (REMO VE)
Figure 8.5 sho ws the windo w whic h will b e displa y ed after clic king on REMO VE
button. Here stop w ords will b e remo v ed from selected dataset les.
48
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.5 Prepro cess Do cumen t (STEMMING)
Figure 8.6: Prepro cess Do cumen t (STEMMING)
Figure 8.6 sho ws the windo w whic h will b e displa y ed after clic king onSTEMMING
button. Here all w ords will b e stemmed to v erb.
e.g. W ords: w orking, w orks, w ork ed will b e stemmed to v erb w ork.
49
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.6 Cluster Pro cess
Figure 8.7: Cluster Pro cess
Figure 8.7 sho ws the windo w for Clustering pro cess. After clic king on Cluster
button, les from selected dataset will b e displa y ed with name & Cluster_ID assigned to it.
Initially Cluster_IDs will b e assigned randomly . After Clic king on K-Mean button, next
windo w will app ear.
50
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.7 K-Means Pro cess (T erm F requency)
Figure 8.8: K-Means Pro cess (T erm F requency)
Figure 8.8 sho ws the K-Means Pro cess windo w. After clic king on Compute T erm
F req button,it sho ws all do cumen ts with attribute coun ts and their resp ectiv e cluster. After
clic king on Next button, next windo w will app ear.
51
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.8 K-Means Pro cess (Cen troid & Distance)
Figure 8.9: K-Means Pro cess (Cen troid & Distance)
Figure 8.9 sho ws the K-Means Pro cess windo w. Here cen troid of eac h cluster is cal-
culated & distance b et w een eac h cluster-cen troid and datap oin t is calculated. After clic king
on Next, next windo w will app ear.
52
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.9 K-Means Pro cess (View Clusters)
Figure 8.10: K-Means Pro cess (View Clusters)
Figure 8.10 sho ws the windo w that consist of clusters(C1,C4).After selecting a
particular cluster from drop do wn list and then clic king on VIEW button, all the do cumen ts
in that clusters are displa y ed here. Clic k on bac k arro w to go to Home-Do cumen t Clustering
windo w again.
53
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.10 Home-Do cumen t Clustering
Figure 8.11: Home-Do cumen t Clustering
Figure 8.11 sho ws the Home-Do cumen t Clustering windo w. K-Means clustering is
applied. After clic king on Inc button, incremen tal clustering algorithm will b e applied. As
the new Impro v ed algorithm is o v ercoming the dra wbac k of K-Means algorithm. This new
algorithm remo v es outlier. F or that Similarit y measuremen t b et w een do cumen ts is done.
F ollo wing windo ws will app ear.
54
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.11 Similarit y V ectors Measuremen t
Figure 8.12: Similarit y V ectors Measuremen t for cluster 1
Figure 8.12 & 8.13 sho ws the Similarit y V ector Measuremen t windo ws for Cluster 1
& Cluster 2.
Figure 8.14 & 8.15 sho ws the Similarit y V ector Measuremen t windo ws for Cluster 3
& Cluster 4.
55
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Figure 8.13: Similarit y V ectors Measuremen t for cluster 2
56
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Figure 8.14: Similarit y V ectors Measuremen t for cluster 3
57
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Figure 8.15: Similarit y V ectors Measuremen t for cluster 4
58
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.12 Finding Dissimilarit y
Figure 8.16: Finding Dissimilarit y
Figure 8.16 sho ws the Option for selecting cluster.
After selecting a particular cluster (1=2=3=4) and clic king on Cosine Similarit y but-
ton, similarit y v alue b et w een all do cumen t in that cluster will b e displa y ed in text
area. This v alue is calculated using cosine similarit y form ula.
59
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
After clic king on Maxim um Dissimilarit y button , dissimilarit y v alue will b e displa y ed
in its cluster id. This distance is calculated using Lev enstein Distance form ula.
After clic king on forw ard arro w button, next windo w will app ear.
60
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.13 Outlier Do cumen t
Figure 8.17: Outlier Do cumen t
Figure 8.17 sho ws the windo w for nding Outlier do cumen t.
Clic king the Get Do cumen t button will displa y ed the details of do cumen t whic h is
not relev an t to dataset do cumen ts.
After clic king on forw ard arro w button, next windo w will app ear.
61
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.14 Clustered Do cumen t (Get Do cumen t)
Figure 8.18: Clustered Do cumen t (Get Do cumen t)
Figure 8.18 sho ws the Clustered Do cumen t windo w. It sho ws the Outlier do cumen t
con ten ts & other do cumen ts name in the same cluster.
After clic king on forw ard arro w button, next windo w will app ear.
62
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.15 Clustered Do cumen t (Obtain Outlier Do cumen t)
Figure 8.19: Clustered Do cumen t (Obtain Outlier Do cumen t)
Figure 8.19 sho ws the Clustered Do cumen t windo w. It sho ws the details of outlier
do cumen t name, its cluster id, a v ailable clusters and related cluster to whic h it can b e mo v ed.
After clic king on forw ard arro w button, next windo w will app ear.
63
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.16 Outlier Do cumen t Mo v emen t
Figure 8.20: Outlier Do cumen t Mo v emen t
Figure 8.20 sho ws the mo v emen t of outlier do cumen t to the related cluster.
After closing the windo w, Home-Do cumen t Clustering windo w will app ear.
64
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.17 Home-Do cumen t Clustering
Figure 8.21: Home-Do cumen t Clustering
Figure 8.21 sho ws the Home-Do cumen t Clustering windo w.
Clic k on Purit y button to see the comparison b et w een existing K-Means algorithm and
new Impro v ed Incremen tal algorithm. F ollo wing windo w will app ear after clic king Purit y
button.
65
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.18 Chec king Purit y
Figure 8.22: Chec king Purit y
Figure 8.22 sho ws the purit y v alue of K-Means algorithm and new Impro v ed algo-
rithm.
Closing this windo w, Home-Do cumen t Clustering windo w will app ear.
66
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.19 Home-Do cumen t Clustering
Figure 8.23: Home-Do cumen t Clustering
Figure 8.23 sho ws the Home-Do cumen t Clustering windo w.
Clic king on Graph button, next windo w will app ear.
67
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.20 Clustering A ccuracy
Figure 8.24: Clustering A ccuracy
Figure 8.24 sho ws the graphical represen tation of comparison b et w een K-Means &
Impro v ed algorithm on the basis of accuracy factor.
Closing this windo w, Home-Do cumen t Clustering windo w will app ear.
68
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.21 Home-Do cumen t Clustering
Figure 8.25: Home-Do cumen t Clustering
Figure 8.25 sho ws the Home-Do cumen t Clustering windo w.
Clic king on Application button, next windo w will app ear.
69
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
8.2.22 Executing Query
Figure 8.26: Executing Query
Figure 8.26 sho ws the Query Searc h windo w.
This windo w sho ws ho w the clustering algorithm can b e used in dieren t applications.
Here, select a term from a drop do wn list and clic k on Searc h button. It sho ws the
name of do cumen t whic h con tains that term, its cluster id, w eigh t and rank.
Closing this windo w, Home-Do cumen t Clustering windo w will app ear.
70
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

Chapter 9
TEST SPECIFICA TION
9.1 INTR ODUCTION
This do cumen t describ es b oth test plan and the test pro cedure along with goals and
ob jectiv es of test sp ecications.
9.1.1 Goals and Ob jectiv es
T ests are the individual tests sp ecied in a test plan do cumen t. Eac h test is t ypically
describ ed b y:
An initial system state.
A set of actions to b e p erformed.
The exp ected results of the test.
9.2 F ORMAL TECHNICAL REVIEWS
F ormal T ec hnical Reviews and Insp ections of do cumen ts or soft w are are p erformed to
iden tify and remo v e defects. The F ormal T ec hnical Review of m y pro ject w as carried at
regular in terv als in the form of stand up meetings and brainstorming sessions conducted in
presence of the guide. The pro cess included v erication of the c hec klist whic h w as dev elop ed
for the review pro cess and the co de review c hec klist template is as follo ws:
Do es the co de conform to Hungarian Notations?
Is the co de w ell-structured, consisten t in st yle and consisten tly formatted?
Are all v ariables prop erly dened with meaningful, consisten t and clear names?
Are there an y redundan t or un used v ariables?
71

A n Ecient T e chnique for Do cument Clustering
Do es the co de consist of commen ts?
Is the co de error free?
Do es the co de handle exceptions?
Do es the co de v erify class name according to the standard format?
Do es the co de a v oid nested lo op v ariables?
Co de refactoring (maxim um 40 lines of co de in a function).
9.3 TEST PLAN AND SCHEDULE
The purp ose of testing is to disco v er errors. T esting is the pro cess of trying to disco v er
ev ery conceiv able fault or w eakness in a w ork pro duct. It pro vides a w a y to c hec k the
functionalit y of comp onen ts, subassem blies, assem blies and/or a nished pro duct. It is the
pro cess of exercising soft w are with the in ten t of ensuring that the soft w are system meets its
requiremen ts and user exp ectations and do es not fail in an unacceptable manner. There are
v arious t yp es of test. Eac h test t yp e addresses a sp ecic testing requiremen t.
1. T est Plan
2. T esting Strategy
3. T est Sc hedule
9.3.1 T est Plan
T est cases are planned in accordance to the test pro cess and do cumen ted with detailed
test descriptions. These test cases use cases based on pro jected op erational mission sce-
narios. The testing pro cess also includes stress / load testing for stabilit y purp ose (i.e.,
at 95p ercen tCPU use, system stabilit y is still guaran teed). The test pro cess thoroughly
tests the in terfaces and mo dules. Soft w are testing includes a traceable white b o x testing,
blac k b o x testing and other test pro cesses v erifying implemen ted soft w are against design
do cumen tation and requiremen ts sp ecied.
Unit testing:
Unit testing in v olv es the design of test cases that v alidate the in ternal program logic
is functioning prop erly , and that program inputs pro duce v alid outputs. All decision
branc hes and in ternal co de o w should b e v alidated. It is the testing of individual
72
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
soft w are units of the application .It is done after the completion of an individual
unit b efore in tegration. This is a structural testing, that relies on kno wledge of its
construction and is in v asiv e. Unit tests p erform basic tests at comp onen t lev el and test
a sp ecic business pro cess, application, and/or system conguration. Unit tests ensure
that eac h unique path of a business pro cess p erforms accurately to the do cumen ted
sp ecications and con tains clearly dened inputs and exp ected results.
In tegration testing:
In tegration tests are designed to test in tegrated soft w are comp onen ts to determine
if they actually run as one program. T esting is ev en t driv en and is more concerned with
the basic outcome of screens or elds. In tegration tests demonstrate that although the
comp onen ts w ere individually satisfaction, as sho wn b y successfully unit testing, the
com bination of comp onen ts is correct and consisten t. In tegration testing is sp ecically
aimed at exp osing the problems that arise from the com bination of comp onen ts.
F unctional testing:
F unctional tests pro vide systematic demonstrations that functions tested are a v ail-
able as sp ecied b y the business and tec hnical requiremen ts, system do cumen tation and
user man uals. F unctional testing is cen tered on the follo wing items:-
V alid Input: Iden tied classes of v alid input m ust b e accepted.
In v alid Input: Iden tied classes of in v alid input m ust b e rejected.
F unctions: Iden tied functions m ust b e exercised.
Output: Iden tied classes of application outputs m ust b e exercised.
Systems/Pro cedures: In terfacing systems or pro cedures m ust b e in v ok ed.
Organization and preparation of functional tests is fo cused on requiremen ts, k ey
functions or sp ecial test cases. In addition, systematic co v erage p ertaining to iden tify
Business pro cess o ws, data elds, predened pro cesses, and successiv e pro cesses m ust
b e considered for testing. Before functional testing is complete, additional tests are
iden tied and the eectiv e v alue of curren t tests is determined.
System T est:
System testing ensures that the en tire in tegrated soft w are system meets require-
men ts. It tests a conguration to ensure kno wn and predictable results. An example
of system testing is the conguration orien ted system in tegration test. System testing
73
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
is based on pro cess descriptions and o ws, emphasizing pre driv en pro cess links and
in tegration p oin ts.
White Bo x T esting:
White Bo x T esting is a testing in whic h in whic h the soft w are tester has kno wledge
of the inner w orkings, structure and language of the soft w are, or at least its purp ose
is used to test areas that cannot b e reac hed from a blac k b o x lev el.
Blac k Bo x T esting:
Blac k Bo x T esting is testing the soft w are without an y kno wledge of the inner w ork-
ings, structure or language of the mo dule b eing tested. Blac k b o x tests, as most other
kinds of tests m ust b e written from a denitiv e source do cumen t, suc h as sp ecication
or requiremen ts do cumen t. It is a testing in whic h the soft w are under test is treated
as a blac k b o x that y ou cannot see in to it. The test pro vides inputs and resp onds to
outputs without considering ho w the soft w are w orks.
9.3.2 T esting Strategy
Field testing will b e p erformed man ually and functional tests are written in detail.
T est ob jectiv es
The en try screen, messages and resp onses m ust not b e dela y ed.
F eatures T o Be T ested:
All links should tak e the user to the correct form.
Soft w are in tegration testing is the incremen tal in tegration testing of t w o or more in te-
grated soft w are comp onen ts on a single platform to pro duce failures caused b y in terface
defects.
9.3.3 T est Sc hedule
9.4 TESTING TOOLS AND ENVIR ONMENT
T o ols for test data deriv ation (e.g., syn thesizing data from path condition)
T o ols for test ev aluation(e.g., v arious co v erage metrics)
T o ols for testing other soft w are qualities
74
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
T est Phase Time Owner
T est Plan Creation 1 wk T est Manager
T est Sp ecication Creation 2 wk T est Leads
T est Sp ecication T eam Review 1 wk Pro ject T eam
Comp onen t T esting 4 wk Comp onen t T esters
In tegration T esting 4 wk Comp onen t And System T esters
System T esting 3 wk System T esters
P erformance T esting 1 wk System T esters
Use Case V alidation 2 wk System T esters
Alpha T esting 1 wk Pro duct Managers / Analysts
Beta T esting Pilot Program 4 wk Pilot Customers
T able 9.1: T est sc hedule
9.4.1 T est w ork pro ducts
T o supp ort new tec hnology
T o supp ort new soft w are pro cesses
T o supp ort a particular metho d or metho dology
9.4.2 Soft w are to b e tested
The soft w are to implemen t this system ma y giv e incorrect output if the soft w are used
in this system is fault y . F ollo wing are the soft w are used in m y system
T est the W ampServ er
T est the Netb eans
75
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
9.5 TEST CASES AND RESUL T
T able 9.2: T est Cases
Begin of T able
T est Case
IDOb jectiv es Steps Exp ected
O/PObserv ed O/P Result
TC1 Selection
of DatasetSelect
DatasetSelected
Files
should b e
tak en as
dataset.Selected Files
are tak en as
dataset.Selected le con-
ten ts are sho wn
in T ext Area.
TC2 Prepro cessing
of selected
datasetStop w ord
Remo v alStop w ords
should b e
remo v ed.Stop w ord are
remo v ed.Selected le
con ten ts are
sho wn in T ext
Area without
stop w ords.
TC3 Prepro cessing
of selected
datasetStemming All w ords
should b e
stemmed
to a v erb.All w ords are
stemmed to a
v erb.All w ords are
stemmed to
a v erb and
displa y ed in
T extArea.
TC4 T erm F re-
quencyCalculate
T erm
F requencyT erm fre-
quency
should b e
calculated
and dis-
pla y in
matrix
format.T erm frequency
is calculated and
displa y ed in ma-
trix format.T erm frequency
is calculated and
displa y ed in ma-
trix format.
TC5 Cen troid &
DistanceCalculate
Cen troid
& DistanceCen troid
& Distance
should b e
calculated.Cen troid & Dis-
tance is calcu-
lated.Cen troid & Dis-
tance is calcu-
lated.
76
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Con tin uation of T able 9.2
T est Case
IDOb jectiv es Steps Exp ected
O/PObserv ed O/P Result
TC6 Cluster
F ormationF orm clus-
tersClusters
should b e
formed
according
to distance
calculated.Clusters are
formed accord-
ing to distance
calculated.Clusters are
formed accord-
ing to distance
calculated.
TC7 Viewing
ClustersSelect clus-
ter to view
do cumen ts
in it.Only do c-
umen ts of
selected
cluster
should b e
sho wn.Only do cumen ts
of selected clus-
ter are sho wn.Only do cumen ts
of selected clus-
ter are sho wn.
TC8 Similarit y
Calcula-
tionCalculate
similarit y
measure
b et w een
do cu-
men ts.Similarit y
should b e
calculated.Similarit y is cal-
culated.Similarit y is cal-
culated.
TC9 Remo v e
OutlierFinding
irrelev an t
do cumen tIrrelev an t
do cumen t
should b e
found and
assign to
the related
cluster.Irrelev an t do cu-
men t is nd out
and assigned to
the related clus-
ter.Irrelev an t do cu-
men t nd out
and assigned to
the related clus-
ter.
TC10 Comparison Compare
b oth ex-
isting
and new
algorithmPurit y
and graph
should b e
sho wn.Purit y and
graph is sho wn.Purit y and
graph is sho wn.
77
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
Con tin uation of T able 9.2
T est Case
IDOb jectiv es Steps Exp ected
O/PObserv ed O/P Result
TC11 Application Execute
QueryAll do c-
umen ts
con taining
the sp eci-
ed term
should b e
listed out.All do cumen ts
con taining the
sp ecied term
are listed out.All do cumen ts
con taining the
sp ecied term
are listed out.
End of T able
9.6 RESUL T ANAL YSIS
The prop osed algorithm pro v es the purit y of the formed clusters using new algorithm
is b etter than the existing K-Means algorithm. This System can b e used in n um b er of text
mining application.
9.7 ANAL YSIS
A t the initial stage, I ha v e implemen ted the algorithm for the predened n um b er of
clusters only and the results obtained are p ositiv e.
78
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

CONCLUSION AND FUTURE SCOPE
Conclusion
As clustering pla ys a v ery vital role in v arious applications, man y researc hes are
still b eing done. The up coming inno v ations are mainly due to the prop erties and the
c haracteristics of existing metho ds. These existing approac hes form the basis for the
v arious inno v ations in the eld of clustering. Clustering of do cumen t is useful for
the purp ose of do cumen t organization, summarization, and information retriev al in
an ecien t w a y . Initially , clustering is applied for enhancing the information retriev al
tec hniques. Of late, clustering tec hniques ha v e b een applied in the areas whic h in v olv e
bro wsing the gathered data or in categorizing the outcome pro vided b y the searc h
engines for the reply to the query raised b y the users.
In this system, I ha v e implemen ted the existing do cumen t clustering that is K-
Means and the new Incremen tal algorithm. The purit y of the formed clusters using
Incremen tal algorithm is b etter than the existing one. This new algorithm can b e used
in man y text mining applications.
F uture Scop e:
This system can b e implemen ted for clustering as p er users c hoice for n um b er of
clusters.
The system will b e used in Newsgroup application to nd the relev an t news to
the selected sector.
This system can b e used in man y text clustering application.
79

A n Ecient T e chnique for Do cument Clustering
App endix A
LIST OF PUBLICA TIONS
1. Priti B.Kudal,Prof. M.M.Naoghare,A Review of Mo dern Do cumen t Clustering T ec h-
niques,In ternational Journal of Science & Researc h(IJSR), ISSN (Online): 2319- 7064,Im-
pact F actor (2012): 3.358 ,V olume 3 Issue 10, Octob er 2014.
2. Priti B. Kudal, Prof. Manisha Naoghare, An Impro v ed Hierarc hical T ec hnique for
Do cumen t Clustering, In ternational Journal of Science & Researc h(IJSR),ISSN (On-
line): 2319-7064.Impact F actor (2012): 3.358 V olume 4 Issue 4, April 2015.
3. Priti B. Kudal, Prof. Manisha Naoghare,An Impro v ed T ec hnique for Do cumen t Clus-
tering, In ternational Journal of T ec hnical Researc h and Applications e-ISSN: 2320-
8163,V olume 3, Issue 3 (Ma y-June 2015), PP . 169-172.
4. Priti B. Kudal, Prof. ManishaNaoghare,An Ecien t T ec hnique for Do cumen t Clus-
tering, In ternational Conference on T ec hnological Inno v ations through Mo dern Engi-
neering Sciences, Ma y 16, 2015 at Shatab di Institute of T ec hnology , Agaskhind,Nashik.
80
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
App endix B
MA THEMA TICAL MODEL
Set Theory
Let,
I is a set of input i.e. Dataset.
F is the set of functions used for the implemen tation.
O is the output. S= (I, F, O)
I: Input Data set
F: Set of F unctions
O: Set of Output
F=F1,F2,……………………..F10.
F1: Do cumen t Prepro cessing
F2: Assigning Cluster IDs
F3: Computing T erm F requency
F4: Cen troid & Distance Calculation
F5: Cluster F ormation
F6: Incremen tal Algorithm
F7: Similarit y Calculation & Finding Dissimilarit y
F8: Mo ving Do cumen t
F9: Calculating Purit y & Analysis b y Graph
F10: Query Execution
F1: Do cumen t Prepro cessing
X: Prepro cessing of all do cumen ts in selected Dataset is done.
F(X): Prepro cessing including stop w ord remo v al, stemming of all do cumen ts in selected
Dataset is done.
F2: Assigning Cluster IDs
X: Initially random cluster ids are assigned to les in selected dataset.
F(X): Initially random cluster ids are assigned to les in selected dataset.
F3: Computing T erm F requency
X: Occurrence of all terms (w ords) in a do cumen t is calculated.
F(X): Num b er of times the w ord has o ccurred is calculated.
F4: Cen troid & Distance Calculation
81
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
X: Initial Cluster cen troid and distance of all data p oin ts from cen troid is calculated.
F(X): Initial Cluster cen troid and distance of all data p oin ts from cen troid is calculated.
F5: Cluster F ormation
X: A ccording to the distance calculated, clusters are formed.
F(X): A ccording to the distance calculated, clusters are formed.
F6: Incremen tal Algorithm
X: Incremen tal Algorithm
F(X): Distance are arranged in ascending order, cen tercalculation, cluster formation done.
F7: Similarit y Calculation & Finding Dissimilarit y .
X: Similarit y b et w een do cumen t and dissimilarit y v alue is calculated.
F(X): A ccording to the selected cluster, similarit y b et w een do cumen ts is calculated.
And dissimilarit y v alue is calculated.
F8: Mo ving Do cumen t
X: Do cumen t is mo v ed to the related cluster.
F(X): Do cumen t is mo v ed to the related cluster.
F9: Calculating Purit y & Analysis b y Graph.
X: Purit y is calculated and graph is displa y ed.
F(X): Purit y is calculated and graph is displa y ed.
F10: Query Execution
X: Query Execution
F(X): Query is executed. W ord is giv en for searc h and the do cumen ts con taining that w ord
are listed with the w eigh t.
Figure 9.1: Mathematical Mo del
82
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

A n Ecient T e chnique for Do cument Clustering
App endix C
c-PGCON-2015 CER TIFICA TES
83
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

References
[1] IEEE TRANSA CTIONS ON KNO WLEDGE AND D A T A ENGINEERING, V OL. 26,
NO.7,JUL Y 2014.A Similarit y Measure for T ext Classication and Clustering Y ung-
Shen Lin, Jung-Yi Jiang, and Shie-Jue Lee, Mem b er, IEEE.
[2] IEEE TRANSA CTIONS ON KNO WLEDGE AND D A T A ENGINEERING, V OL. 16,
NO.10,OCTOBER 2014.Ecien t Phrase-based Do cumen t Indexing for W eb Do cumen t
Clustering,Khaled M.Hammouda,Mohamed S. Kamel.
[3] IEEE TRANSA CTIONS ON KNO WLEDGE AND D A T A ENGINEERING, V OL.
20, NO.9,SEPTEMBER 2008.Ecien t Phrase-based Do cumen t Similarit y for Clus-
tering,Hung Chim and Xiaotie Deng.
[4] B.K.L.F ei,J.H.P .Elo,H.S.V en ter,andM.S.Oliv er,Exploring forensic data with self-
organizing maps, in Pro c. IFIP In t. Conf. Digital F orensics, 2005, pp. 113-123.
[5] N. L. Beeb e and J. G. Clark, Digital forensic text string searc hing: Impro ving informa-
tion Retriev al eectiv eness b y thematically clustering searc h results, Digital In v estiga-
tion,Elsevier, v ol. 4, no. 1, pp. 49-54, 2007.
[6] R. Hadjidj, M. Debbabi, H. Lounis, F. Iqbal, A. Szp orer, and D. Benredjem,T o w ards
an in tegrated e-mail forensic analysis framew ork, Digital In v estigation, Elsevier, v ol. 5,
no.34, pp. 124-137, 2009.
[7] F. Iqbal, H. Binsalleeh, B. C. M. F ung, and M. Debbabi, Mining writeprin ts from
anon ymous e-mails for forensic in v estigation, Digital In v estigation, Elsevier, v ol. 7, no.
12, pp. 56-64, 2010.
[8] T ext Clustering for Digital F orensics Analysis,Sergio Dec herc hi1, Simone T acconi2,
Judith Redi1, Alessio Leoncini1, F abio Sangiacomo1 and Ro dolfo Zunino1.
[9] K. Stoel, P . Cotofrei, and D. Han, F uzzy metho ds for forensic data analysis, in Pro c.
IEEE In t. Conf. Soft Computing and P attern Recognition, 2010, pp. 2328.l
84

A n Ecient T e chnique for Do cument Clustering
[10] Nicole Lang Beeb e,Jan Guynes Clark,Digital F orensic text string searc hing:Impro ving
information retriev al eectiv eness b y thematic ally clustering searc h results.
[11] Hierarc hical Clustering Algorithms for Do cumen t Datasets,Ying Zhao and George
Karypis,Departmen t of Computer Science, Univ ersit y of Minnesota, Minneap olis, MN
55455.
[12] L. V endramin, R. J. G. B. Camp ello, and E. R. Hrusc hk a, Relativ e clustering v alidit y
criteria: A comparativ e o v erview, Statist. Anal.Data Mining, v ol. 3, pp. 209235, 2010.
[13] B.K.L.F ei,J.H.P .Elo,H.S.V en ter,andM.S.Oliv er,Exploring forensic data with self- or-
ganizing maps, in Pro c. IFIP In t. Conf. Digital F orensics, 2005, pp. 113123
[14] B. Mirkin, Clustering for Data Mining: A Data Reco v ery Approac h. London, U.K.:
Chapman & Hall, 2005.
[15] A. L. N. F red and A. K. Jain, Com bining m ultiple clusterings using evidence accu-
m ulation, IEEE T rans. P attern Anal. Mac h. In tell., v ol. 27, no. 6, pp. 835850, Jun.
2005.
[16] L. Hub ert and P . Arabie, Comparing partitions, J. Classication, v ol. 2, pp. 193218,
1985.
[17] C. M. Bishop, P attern Recognition and Mac hine Learning. New Y ork: Springer-V erlag,
2006.
[18] S. Ha ykin, Neural Net w orks: A Comprehensiv e F oundation. Englew o o d Clis, NJ:
Pren tice- Hall, 1998.
85
SVIT Chinc holi, Departmen t of Computer Engineering 2015-16

Copyright Notice

© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.

Acest articol: An E cien t T ec hnique for Do cumen t Clustering SUBMITTED TO SA VITRIBAI PHULE PUNE UNIVERSITY,PUNE IN P AR TIAL FULFILLMENT OF THE REQUIREMENTS F… [600279] (ID: 600279)

Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.

An E cien t T ec hnique for Do cumen t Clustering SUBMITTED TO SA VITRIBAI PHULE PUNE UNIVERSITY,PUNE IN P AR TIAL FULFILLMENT OF THE REQUIREMENTS F… [600279]

Copyright Notice

Licenta Android Matea Andrei(1) [302801]

SPECIALIZAREA: COMUNICARE ȘI DISCURS INTERCULTURAL ÎN SPAȚIUL EUROPEAN [307958]

Albinele Ortogramele I A I Au [623913]

MASTER – Inginerie de Mentenanță pentru ELI-NP IMentELI [304560]

Specializarea Microelectronică, Optoelectronică și Nanotehnologii [603318]

Metafora și metamorfoza [621242]

Copyright Notice

Similar Posts