An Efficient Technique For Document Clustering

A DISSERTATION REPORT

ON

An Efficient Technique for Document Clustering

SUBMITTED TO

SAVITRIBAI PHULE PUNE UNIVERSITY,PUNE

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE

AWARD OF DEGREE

Master Of Engineering (Computer Engineering)

BY

Priti Balu Kudal Exam Seat No:

Under the guidance of

Prof.M.M.Naoghare

DEPARTMENT OF COMPUTER ENGINEERING

Pravara Rural Education Society’s

SIR VISVESVARAYA INSTITUTE OF

TECHNOLOGY,CHINCHOLI

NASHIK-422102

A DISSERTATION REPORT

ON

An Efficient Technique for Document Clustering

SUBMITTED TO

SAVITRIBAI PHULE PUNE UNIVERSITY, PUNE

IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE

AWARD OF DEGREE

Master of Engineering (Computer Engineering)

BY

Priti Balu Kudal Exam Seat No:

Under the guidance of

Prof.M.M.Naoghare

DEPARTMENT OF COMPUTER ENGINEERING

2015-16

CERTIFICATE

This is to certify that the dissertation report entitled

“An Efficient Technique for Document Clustering”

Submitted by

Priti Balu Kudal Exam Seat No:

is a bonafide work carried out by her under the supervision of Prof. M. M. Naoghare and it is submitted towards the partial fulfillment of the requirement of Savitribai Phule, Pune University, Pune for the award of the degree of Master of Engineering(Computer Engineering).

(Prof. M.M.Naoghare) (Prof.S.M.Rokade)

Internal Guide Head

Department of Comp. Engg. Department of Comp. Engg.

SVIT,Chincholi,Nashik SVIT,Chincholi,Nashik

Seal/Stamp of the College (Dr. G. B. Shinde)

Principal

SVIT,Chincholi,Nashik

Place: (Prof./Dr. )

Date: External Examiner

Certificate by Guide

This is to certify that Priti Balu Kudal has completed the dissertation work under my guidance and supervision and that, I have verified the work for its originality in documentation, problem statement, implementation and results presented in the dissertation. Any reproduction of other necessary work is with the prior permission and has given due ownership and included inthe references.

Prof. M.M.Naoghare

Place: Chincholi

Date:

ACKNOWLEDGEMENT

I take this opportunity to express my hearty thanks to all those who helped me in the completion of the dissertation on this topic.

I express my deep sense of gratitude to my dissertation guide Prof. M.M.Naoghare, Computer Engineering, Sir Visvesvaraya Institute of Technology, Chincholi for her guidance and continuous motivation. I gratefully acknowledge the help provided by her on many occasions, for improvement of this dissertation with great interest.

I would be failing in my duties, if I do not express my deep sense of gratitude to Prof.S.M.Rokade, Head , Computer Engineering Department for permitting me to avail the facility and constant encouragement.

I would be failing in my duties, if I do not express my deep sense of gratitude to Dr.G.B.Shinde, Principal, Sir Visvesvaraya Institute of Technology for permitting me to avail the facility and constant encouragement.

I would also like to thank to Prof. M.M.Naoghare P.G. Co-ordinator for her great support and excellent guidance.

I would like to extend my sincere thanks to my family members. It is my privilege to acknowledge their cooperation during the course of this dissertation. I express my heartiest thanks to my known and unknown well-wishers for their unreserved cooperation, encouragement and suggestions during the course of this dissertation.

Last but not the least, I would like to thanks to all my teachers, and all my friends who helped me with the ever daunting task of gathering information for the dissertation.

Priti Balu Kudal

ABSTRACT

Clustering is a division of data into groups of similar objects. Each group, called cluster, consists of objects that are similar between themselves and dissimilar to objects of other groups. Clustering of document is important for the purpose of document organization, summarization, topic extraction and information retrieval in an efficient way. Initially, clustering is applied for enhancing the information retrieval techniques. Of late, clustering techniques have been applied in the areas which involve browsing the gathered data or in categorizing the outcome provided by the search engines for the reply to the query raised by the users. In document clustering, hundreds of thousands of files are usually examined. Much of the data in those files consists of unstructured text, whose analysis by computer examiners is difficult to be performed. In particular, algorithms for clustering documents can facilitate the discovery of new and useful knowledge from the documents under analysis. I have proposed a more efficient document clustering algorithm.

Keywords: Data Mining, Clustering, Similarity Measure, Cosine Similarity Term Frequency.

Contents

Acknowledgement

Abstract

Index

List of Figures

List of Tables

List of Publications

1 SYNOPSIS

1.1 OBJECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.2 MOTIVATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.3 HYPOTHESIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.4 STRATEGY PLANNED ASSOSIATED WITH THE DISSERTATION . . .

1.5 REVIEW OF CONFERENCE/JOURNAL PAPERS SUPPORTING DISSERTATION

IDEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.6 PLAN OF DISSERTATION EXECUTION . . . . . . . . . . . . . . . . . .

1.7 PROBLEM STATEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.8 SOLVING APPROACH . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.9 EFFICIENCY ISSUES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1.10 OUTCOMES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2 TECHNICAL KEYWORDS

2.1 CATEGORIES AND SUBJECT DESCRIPTOR . . . . . . . . . . . . . . . .

2.2 GENERAL TERMS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3 TECHNICAL KEYWORDS . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 INTRODUCTION

3.1 DISSERTATION IDEA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.2 MOTIVATION OF THE DISSERTATION . . . . . . . . . . . . . . . . . .

3.3 LITERATURE SURVEY . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 PROBLEM DEFINITIONS AND SCOPE

4.1 PROBLEM STATEMENT . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.2 GOALS AND OBJECTIVES . . . . . . . . . . . . . . . . . . . . . . . . . .

4.3 STATEMENT OF SCOPE . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4 SOFTWARE CONTEXT . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.5 MAJOR CONSTRAINT . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.6 APPROACHES FOR SOLVING THE PROBLEM AND EFFICIENCY ISSUES

4.7 OUTCOMES . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4.8 HARDWARE RESOURCE REQUIRED . . . . . . . . . . . . . . . . . . . .

4.9 SOFTWARE RESOURCE REQUIRED . . . . . . . . . . . . . . . . . . . .

4.10 AREA OF DISSERTATION . . . . . . . . . . . . . . . . . . . . . . . . . . .

5 DISSERTATION PLAN

5.1 PURPOSE . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2 DISSERTATION PLAN . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.3 GANTT CHART . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 SOFTWARE REQUIREMENT SPECIFICATION

6.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.1.1 Purpose and scope of document . . . . . . . . . . . . . . . . . . . . .

6.1.2 Overview of responsibilities of developer . . . . . . . . . . . . . . . .

6.2 PRODUCT OVERVIEW . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.3 USAGE SCENARIO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.3.1 User Profile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.3.2 Use-cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.3.3 Use-case view . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.4 DATA MODEL AND DESCRIPTION . . . . . . . . . . . . . . . . . . . . .

6.4.1 Data Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.5 FUNCTIONAL MODEL AND DESCRIPTION . . . . . . . . . . . . . . . .

6.5.1 Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.5.2 Non Functional Requirement . . . . . . . . . . . . . . . . . . . . . . .

6.5.3 Functional Flow Diagram . . . . . . . . . . . . . . . . . . . . . . . .

6.5.4 Software Quality Attributes . . . . . . . . . . . . . . . . . . . . . .

6.5.5 Design Constraint . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.5.6 Software Interface Description . . . . . . . . . . . . . . . . . . .

6.6 RESTRICTIONS,LIMITATIONS,AND CONSTRAINTS . . . . . . . . . . .

6.7 VALIDATION CRITERIA . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.7.1 Classes of tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.7.2 Estimated Software response . . . . . . . . . . . . . . . . . . . . . . .

7 DETAILED DESIGN DOCUMENT

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.1.1 Overview of products . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.2 ARCHITECHTURAL DESIGN . . . . . . . . . . . . . . . . . . . . . . . . .

7.2.1 System Architecture: . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.3 System Model: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.4 COMPONENT DESIGN . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.4.1 Class diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.4.2 Interaction Diagrams . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.4.3 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.5 PROGRAM STRUCTURE . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.5.1 Architecture diagram . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.5.2 DESCRIPTION FOR COMPONENT . . . . . . . . . . . . . . . . .

8 IMPLEMENTATION

8.1 IMPLEMENTATION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.2 SNAP SHOTS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8.2.1 First Screen

8.2.2 Selecting Dataset

8.2.3 Defining number of clusters and assigning documents to clusters.(Randomly)

8.2.4 K-Means: Calculating Term Frequency

8.2.5 Calculating centroid and distance from centroid to all documents

8.2.6 K-Means: Assigning documents to nearest cluster till no repetition of movement of

8.2.7 Similarity Measurement

8.2.8 Finding Dissimilarity

8.2.9 Finding Outlier Document

8.2.10 Moving Outlier Document

8.2.11 Comparing purity of both Algorithms

8.2.12 Executing Query

9 TEST SPECIFICATION

9.1 INTRODUCTION . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.1.1 Goals and Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.2 FORMAL TECHNICAL REVIEWS . . . . . . . . . . . . . . . . . . . . . .

9.3 TEST PLAN AND SCHEDULE . . . . . . . . . . . . . . . . . . . . . . . . .

9.3.1 Test Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.3.2 Testing Strategy . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.3.3 Test Schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.4 TESTING TOOLS AND ENVIRONMENT . . . . . . . . . . . . . . . . . .

9.4.1 Test work products . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.4.2 Software to be tested . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.5 TEST CASES AND RESULT . . . . . . . . . . . . . . . . . . . . . . . . . .

9.6 RESULT ANALYSIS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.7 ANALYSIS: . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

CONCLUSION AND FUTURE SCOPE

APPENDICES

A LIST OF PUBLICATION

B MATHEMATICAL MODEL

C PLAGIARISM REPORT

REFERENCES

List of Figures

5.1 Gantt Chart . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.1 Use case diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.2 DFD 0 diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.3 DFD 1 diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.4 DFD 2 diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6.5 Functional flow diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.1 System architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

7.2 Class Diagram……………………………………………………..

7.3 Sequence Diagram

List of Tables

5.1 Dissertation Plan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.1 Test schedule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9.2 Test cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

LIST OF PUBLICATION

[1] Priti B. Kudal,Prof. M.M.Naoghare,”A Review of Modern Document Clustering

Techniques”,International Journal of Science & Research(IJSR), ISSN (Online): 2319-7064,Impact Factor (2012): 3.358 ,Volume 3 Issue 10, October 2014.

[2] Priti B. Kudal, Prof. ManishaNaoghare, “An Improved Hierarchical Technique for

Document Clustering”, International Journal of Science &Research(IJSR),ISSN

(Online): 2319-7064, Impact Factor (2012): 3.358 Volume 4 Issue 4, April 2015.

[3] Priti B. Kudal, Prof. ManishaNaoghare,“An Improved Technique for Document

Clustering,International Journal of Technical Research and Applications e-ISSN:

2320-8163,Volume 3, Issue 3 (May-June 2015), PP. 169-172.

Chapter 1

SYNOPSIS

DISSERTATION TITLE

An Efficient Technique for Document Clustering

INTERNAL GUIDE

Prof. M.M.Naoghare

TECHNICAL KEYWORDS

Data Mining, Clustering, Similarity Measure, Cosine Similarity Term Frequency.

OBJECTIVES

The objective is to develop a more efficient document clustering algorithm in comparison to the existing method. The clustering accuracy of the proposed method will be better as compared to the existing method.

MOTIVATION

As web is growing larger day by day and with the growth of social networking, the documents collected from the web are becoming more and more condensed. Such data due to the sparseness imposes new challenges in applying clustering techniques. Clustering such sparse data could lead to new trends in web search and other such applications.

Clustering of document is important for the purpose of document organization, summarization, topic extraction and information retrieval in an efficient way. Initially, clustering is applied for enhancing the information retrieval techniques. Of late, clustering techniques have been applied in the areas which involve browsing the gathered data or in categorizing the outcome provided by the search engines for the reply to the query raised by the users.

The main motivation of this work has been to investigate possibilities for the improvement of the effectiveness of document clustering by finding out the main reasons of ineffectiveness of the already built algorithms and get their solutions.

HYPOTHESIS

Document clustering is becoming more and more important with the abundance of text documents available through World Wide Web and corporate document management systems. But there are still some major drawbacks in the existing text clustering techniques that greatly affect their practical applicability.

The drawbacks in the existing clustering approaches are listed below:

Text clustering that yields a clear cut output has got to be the most favorable. However, documents can be regarded differently by people with different needs vis-à-vis the clustering of texts. For example, a businessman looks at business documents not in the same way as a technologist sees them (Macskassy et al. 1998). So clustering tasks depend on intrinsic parameters that make way for a diversity of views.

Text clustering is a clustering task in a high-dimensional space, where each word is seen as an important attribute for a text. Empirical and mathematical analysis have revealed that clustering in high-dimensional spaces is very complex, as every data point is likely to have the same distance from all the other data points (Beyer et al. 1999).

Desirable Properties of a Clustering Algorithm:

Scalability (in terms of both time and space)

Ability to deal with different data types

Able to deal with noise and outliers

Insensitive to order of input records

Incorporation of user-specified constraints

Interpretability and usability

There are only a few studies reporting the use of clustering algorithms in the Computer Forensics field. Essentially, most of the studies describe the use of classic algorithms for clustering data—e.g., Expectation-Maximization (EM) for unsupervised learning of Gaussian Mixture Models, K-means, Fuzzy C-means (FCM), and Self-Organizing Maps (SOM).

These algorithms have well-known properties and are widely used in practice.

For instance, K-means and FCM can be seen as particular cases of EM [8]. Algorithms like SOM [9], in their turn, generally have inductive biases similar to K-means, but are usually less computationally efficient.

STRATEGY PLANNED ASSOCIATED WITH THE DISSERTATION

Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information – information that can be used to increase revenue, cuts costs, or both. Information Retrieval (IR) (Baeza 1992) is the field of computer science that focuses on the processing of documents in such a way that the document can be quickly retrieved based on keywords specified in a user’s query. IR technology is the foundation of web-based search engines and plays a key role in biomedical research, as it is the basis of software that aids literature search. Number of clustering algorithms is available today, and the most commonly used algorithm is K-Means. But it’s sensitive to outlier. So incremental K-Mean is designed to overcome it and can be used in different application e.g.newsgroup application. So that only related documents can be viewed for a fired query and that will reduce response time.

REVIEW OF CONFERENCE/JOURNAL PAPERS OF SUPPORTING

DISSERTATION IDEA

There are only a few studies reporting the use of clustering algorithms in the Computer Forensics field. Essentially, most of the studies describe the use of classic algorithms for clustering data—e.g., Expectation-Maximization (EM), K-means, Fuzzy C-means (FCM), and Self-Organizing Maps (SOM). These algorithms have well-known properties and are widely used in practice. For instance, K-means and FCM can be seen as particular cases of EM. The underlying assumption is that the clustered results can increase the information retrieval efficiency, because it would not be necessary to review all the documents found by the user anymore.

Clustering Methods

Common clustering techniques come under two broad categories:

Partitional

Hierarchical

Partitional

K-Means

K-means is the most important flat clustering algorithm. The objective function of K-means is to minimize the average squared distance of objects from their cluster centers, where a cluster center is defined as the mean or centroid µ of the objects in a cluster C.

The ideal cluster in K-means is a sphere with the centroid as its center of gravity. Ideally, the clusters should not overlap. A measure of how well the centroids represent the members of their clusters is the Residual Sum of Squares (RSS), the squared distance of each vector from its centroid summed over all vectors.

Construct various partitions and then evaluate them by some criterion.

Nonhierarchical, each instance is placed in exactly one of K non-overlapping clusters. Since only one set of clusters is output, the user normally has to input the desired number of Clusters K.

Comments on K-Means:

Strength:

Simple, easy to implement and debug

Intuitive objective function: optimizes intra-cluster similarity

Relatively efficient: O(tkn), where n is # objects, k is # clusters, and t is # iterations. Normally, k, t << n.

Weakness:

Applicable only when mean is defined, then what about categorical data?

Often terminates at a local optimum. Initialization is important.

Need to specify K, the number of clusters, in advance.

Unable to handle noisy data and outliers.

Expectation Maximization

The EM algorithm fall within a subcategory of the flat clustering algorithms, called Model-based clustering. The model-based clustering assumes that data were generated by a model and then tries to recover the original model from the data. This model then defines clusters and the cluster membership of data. The EM algorithm is a generalization of K-Means algorithm in which the set of K centroids as the model that generate the data. It alternates between an expectation step, corresponding to reassignment, and a maximization step, corresponding to recomputation of the parameters of the model.

Hierarchical:

Hierarchical clustering approaches attempt to create a hierarchical decomposition of the given document collection thus achieving a hierarchical structure. Hierarchical methods are usually classified into Agglomerative and Divisive methods depending on how the hierarchy is constructed. Agglomerative methods start with an initial clustering of the term space, where all documents are considered representing a separate cluster.

• Bottom-Up (agglomerative): Starting with each item in its own cluster,find the best pair

to merge into a new cluster. Repeat until all clusters arefused together.

• Top-Down (divisive): Starting with all the data in a single cluster, consider every

possible way to divide the cluster into two. Choose the best division and recursively

operate on both sides.

Summary of Hierarchal Clustering Methods:

No need to specify the number of clusters in advance.

Hierarchical structure maps nicely onto human intuition for some domains.

They do not scale well: time complexity of at least O(n2), where n is the number of total objects.

Like any heuristic search algorithms, local optima are a problem.

Interpretation of results is (very) subjective.

PLAN OF DISSERTATION EXECUTION

Entire work is divided into following tasks to facilitate dissertation executions:

• Phase-1 Choose Research Area

• Phase-2 Preliminary Research

• Phase-3 Decide Research Topic

• Phase-4 Decide Methodology

• Phase-5 Submit or present proposal

• Phase-6 Finalize method

• Phase-7 Conduct Research

• Phase-8 Analyze Data

• Phase-9 Report Writing

• Phase-10 Submitting Report

PROBLEM STATEMENT

The problem statement of my work is that the quality of document clustering can be improved by applying some new clustering techniques. Firstly the effectiveness of pre-processing of data representation will be studied and then the classical clustering methods will be applied. An effective new clustering algorithm will be developed in which the noise is reduced by first clustering the features of the data and then clustering the data on the basis of their feature’s clusters.

PROBLEM SOLVING APPROACH

An improved technique for document clustering is developed to overcome the drawback of K-Means. Both K-Means and Incremental K-Means algorithm are implemented and applied on available dataset. Comparison of both algorithms is shown. Hereby, I summarize the basic applications in which proposed concept can be used:

1. Finding Similar Documents:

This feature is often used when the user has spotted one “good” document in a search result and wants more-like-this. The interesting property here is that clustering is able to discover documents that are conceptually alike in contrast to search-based approaches that are only able to discover whether the documents share many of the same words.

2. Organizing Large Document Collections:

Document retrieval focuses on finding documents relevant to a particular query, but it fails to solve the problem of making sense of a large number of uncategorized documents. The challenge here is to organize these documents in a taxonomy identical to the one humans would create given enough time and use it as a browsing interface to the original collection of documents.

3. Duplicate Content Detection:

In many applications there is a need to find duplicates or near-duplicates in a large number of documents. Clustering is employed for plagiarism detection, grouping of related news stories and to reorder search results rankings (to assure higher diversity among the topmost documents). Note that in such applications the description of clusters is rarely needed.

4. Search Optimization:

Clustering helps a lot in improving the quality and efficiency of search engines as the user query can be first compared to the clusters instead of comparing it directly.

EFFICIENCY ISSUES

Though K-means clustering has some strong advantages, there are some efficiency issues with it.K-means clustering does not work well with clusters (in the original data) of Different size and Different density. This algorithm is sensitive to outliers, since an object with an extremely large value may substantially distort the distribution of the data. To overcome the drawbacks of K-Means, Incremental algorithm is developed and an efficiency of this algorithm is better than that of the K-Means.

OUTCOMES

K-Means algorithm is developed.

Incremental algorithm is developed.

Comparison of both algorithms is shown.

Query is fired and results are shown.

Chapter 2

TECHNICAL KEYWORDS

CATEGORIES AND SUBJECT DESCRIPTOR

CATEGORIES

Document Clustering

Similarity Measure

K-Means Algorithm

Incremental Algorithm

SUBJECT DESCRIPTOR

Data Mining

2.2 GENERAL TERMS

Data Mining, Information Retrieval, Text Mining,Features,Query

TECHNICAL KEYWORDS

Data Mining

Data mining is an interdisciplinary subfield of computer science. It is the computational process of discovering patterns in large data sets ("big data") involving methods at the intersection of artificial intelligence, machine learning, statistics, and database systems.

Clustering

Clustering is the task of discovering groups and structures in the data that are in some way or another "similar", without using known structures in the data.

Classification

It is the task of generalizing known structure to apply to new data. For example, an e-mail program might attempt to classify an e-mail as "legitimate" or as "spam".

Similarity Measure

It can represent the similarity between two documents, two queries, or one document and one query. It is a function which computes the degree of similarity between a pair of text objects.

Term Frequency

It is the number of times a term occurs in a document is called its term frequency.

Chapter 3

INTRODUCTION

3.1 DISSERTATION IDEA

Data mining is the process of non-trivial discovery from implied, previously unknown, and potentially useful information from data in large databases. Hence it is a core element in knowledge discovery, often used synonymously. Clustering, one of technique for data mining used for grouping similar terms together. Earlier statistical analysis used in text mining depends on term frequency.Then, new concept based text mining model was introduced which analyses terms. Clustering of document is useful for the purpose of document organization, summarization, and information retrieval in an efficient way. Initially, clustering is applied for enhancing the information retrieval techniques. Of late, clustering techniques have been applied in the areas which involve browsing the gathered data or in categorizing the outcome provided by the search engines for the reply to the query raised by the users.

Clustering is sometimes erroneously referred to as automatic classification; however, this is inaccurate, since the clusters found are not known prior to processing whereas in case of classification the classes are pre-defined.

In clustering, it is the distribution and the nature of data that will determine cluster membership, in opposition to the classification where the classifier learns the association between objects and classes from a so called training set, i.e. a set of data correctly labeled by hand, and then replicates the learnt behavior on unlabeled data.

Document clustering is the process of categorizing text document into a systematic cluster or group, such that the documents in the same cluster are similar whereas the documents in the other clusters are dissimilar. It is one of the vital processes in text mining. Liping (2010) emphasized that the expansion of internet and computational processes has paved the way for various clustering techniques. Text mining especially has gained a lot of importance and it demands various tasks such as production of granular taxonomies, document summarization etc., for developing a higher quality information from text.

This system is providing both algorithms-K-Means, Incremental K-Means. Comparison of both algorithm is done and use of this algorithm is shown.

3.2 MOTIVATION OF DISSERTATION

An application can be developed for recommendations of news articles to the readers of a news portal. The following challenges gave me the motivation to use clustering of news articles:

The number of available articles is large.

A large number of articles are added each day.

Articles corresponding to same news are added from different sources.

The recommendations have to be generated and updated in real time.

By clustering the articles we could reduce our domain of search for recommendations as most of the users had interest in the news corresponding to a few number of clusters. This improved our time efficiency to a great extent. Also we could identify the articles of same news from different sources. The main motivation of this work has been to investigate possibilities for the improvement of the effectiveness of document clustering by finding out the main reasons of ineffectiveness of the already built algorithms and get their solutions.

LITERATURE SURVEY

It is important to emphasize that getting from a collection of documents to a clustering of the collection, is not merely a single operation, but is more a process in multiple stages. These stages include more traditional information retrieval operations such as crawling, indexing, weighting, filtering etc. Some of these other processes are central to the quality and performance of most clustering algorithms, and it is thus necessary to consider these stages together with a given clustering algorithm to harness its true potential. We will give a brief overview of the clustering process, before we begin our literature study and analysis.

There are only a few studies reporting the use of clustering algorithms in the Computer Forensics field. Essentially, most of the studies describe the use of classic algorithms for clustering data—e.g., Expectation-Maximization (EM) for unsupervised learning of Gaussian Mixture Models, K-means, Fuzzy C-means (FCM), and Self-Organizing Maps (SOM).

These algorithms have well-known properties and are widely used in practice. For instance, K-means and FCM can be seen as particular cases of EM. Algorithms like SOM, in their turn, generally have inductive biases similar to K-means, but are usually less computationally efficient.

SOM-based algorithms were used for clustering files with the aim of making the decision-making process performed by the examiners more efficient. The files were clustered by taking into account their creation dates/times and their extensions.The underlying assumption is that the clustered results can increase the information retrieval efficiency, because it would not be necessary to review all the documents found by the user anymore.

K-Means is the most important flat clustering algorithm. The objective function of K-means is to minimize the average squared distance of objects from their cluster centers, where a cluster center is defined as the mean or centroid µ of the objects in a cluster C:

The ideal cluster in K-means is a sphere with the centroid as its center of gravity. Ideally, the clusters should not overlap. A measure of how well the centroids represent the members of their clusters is the Residual Sum of Squares (RSS), the squared distance of each vector from its centroid summed over all vectors.

K-means can start with selecting as initial clusters centers K randomly chosen objects, namely the seeds. It then moves the cluster centers around in space in order to minimize RSS. This is done iteratively by repeating two steps until a stopping criterion is met.

1. Reassigning objects to the cluster with closest centroid.

2. Recomputing each centroid based on the current members of its cluster.

We can use one of the following termination conditions as stopping criterion

• A fixed number of iterations I has been completed.

• Centroids µ do not change between iterations.

• Terminate when RSS falls below a pre-established threshold.

Comparison of Various Systems:

K-means is the most important flat clustering algorithm. The objective function of K-means is to minimize the average squared distance of objects from their cluster centers, where a cluster center is defined as the mean or centroid µ of the objects in a cluster C:

• K-means has problems when clusters are of differing Sizes, Densities,Non-globular

Shapes.

• Problems with outliers

• Empty clusters

The K Nearest Neighbour( K-NN) suffers from the following drawbacks:

• Because all the work is done at run-time, k-NN can have poor run-time performance if

the training set is large.

• k-NN is very sensitive to irrelevant or redundant features because all features contribute

to the similarity and thus to the classification. This can be ameliorated by careful feature

selection or feature weighting.

• On very difficult classification tasks, k-NN may be outperformed by more exotic

techniques such as Support Vector Machines or Neural Networks.

Likaset al. (2003) proposed the global K-means clustering technique that creates initial centers by recursively dividing data space into disjointed subspaces using the K-dimensional tree approach. The cutting hyper plane used in this approach is the plane that is perpendicular to the maximum variance axis derived by Principal Component Analysis (PCA). Partitioning was carried out as far as each of the leaf nodes possess less than a predefined number of data instances or the predefined number of buckets has been generated. The initial center for K-means is the centroids of data that are present in the final buckets. Shehroz Khan and Amir Ahmad (2004) stipulated iterative clustering techniques to calculate initial cluster centers for K-means. This process is feasible for clustering techniques for continuous data.

Agrawalet al. (2005) ascribed data mining applications and their various requirements on clustering techniques. The main requirements considered are their ability to identify clusters embedded in subspaces. The subspaces contain high dimensional data and scalability. They also consist of the comprehensible ability of results by end-users and distribution of unpredictable data transfer.

The main limitation of K-means approach is that it generates empty clusters based on initial center vectors. However, this drawback does not cause any significant problem for static execution of K-means algorithm and the problem can be overcome by executing K-means algorithm for a number of times. However, in a few applications, the cluster issue poses problems of erratic behavior of the system and affects the overall performance. Malay Pakhira et al. (2009) mooted a modified version of the K-means algorithm that effectively eradicates this empty cluster problem. In fact, in the experiments done in this regard, this algorithm showed better performance than that of traditional methods.

Only a few researchers have focused attention on partitioning categorical data in an incremental mode. Designing an incremental clustering for categorical data is a vital issue. Li Taoyinget al. (2010) lent support to an incremental clustering for categorical data using clustering ensemble. They initially reduced redundant attributes if required, and then made use of true values of different attributes to form clustering memberships.

Chapter 4

PROBLEM DEFINITION AND SCOPE

PROBLEM STATEMENT

This dissertation presents the clustering techniques, comparison of existing clustering technique i.e. K-Means and the new incremental K-Means.The quality of document clustering can be improved by applying some new clustering techniques. The effectiveness of a new clustering algorithm in which the noise is reduced by first clustering the features of the data and then clustering the data on the basis of their feature’s clusters.

GOALS AND OBJECTIVES

Goals:

To develop an efficient technique for document clustering.

Objectives:

To minimize intra-cluster distances between documents, while maximizing inter-cluster distances.

To develop a more efficient document clustering algorithm in comparison to the existing method. The clustering accuracy of the proposed method will be better as compared to the existing method.

To compare existing document clustering techniques with the new one.

To execute query and show the results.

STATEMENT OF SCOPE

Develop an application for recommendations of news articles to the readers of a news portal. The following challenges gave us the motivation to use clustering of news articles:

The number of available articles was large.

A large number of articles were added each day.

Articles corresponding to same news were added from different sources.

The recommendations had to be generated and updated in real time.

By clustering the articles we could reduce our domain of search for recommendations as most of the users had interest in the news corresponding to a few number of clusters. This improved our time efficiency to a great extent. Also we could identify the articles of same news from different sources. The main motivation of this work has been to investigate possibilities for the improvement of the effectiveness of document clustering by finding out the main reasons of ineffectiveness of the already built algorithms and get their solutions.

SOFTWARE CONTEXT

Language : Java 1.5.0

Database : MySQL 5.1.36

Web Server: Apache 2.2.11

Tool : Netbeans IDE 7.0.1

Hardware : Personal Computer with minimum configuration

Documentation Tools : LaTex

MAJOR CONSTRAINT

This system provides new clustering algorithm for document clustering.

Need Dataset for document clustering.

Number of cluster has to be defined in advance.

Preprocessing of document needs to be done.

APPROACHES FOR SOLVING THE PROBLEMS AND EFFICIENCY ISSUES

Refer Chapter 1 section 1.9, 1.10

4.7 OUTCOMES

Refer Chapter 1 section 1.11 for details.

HARDWARE RESOURCES REQUIRED

Processor – Pentium –IV

RAM – 256 MB(min)

Hard Disk – 20 GB

Monitor – SVGA

SOFTWARE RESOURCES REQUIRED

Operating System – Windows XP

Programming Language – JAVA

Java Version – JDK 1.6 & above.

Netbean IDE – 7.0.1

Database – MySQL 5.1.36

Web Server – Apache 2.2.11

Documentation Tool – LaTex

AREA OF DISSERTATION

Data Mining

Chapter 5

DISSERTATION PLAN

. PURPOSE

List of tasks along with the schedule is shown in Table 5.2 and corresponding Ganttchart in Fig. 5.1. This chart makes it simple to get status of project.

5.2. DISSERTATION PLAN

Project was divided into 10 phases:

Figure 5.1: Dissertation Plan

GANTT CHART

Figure 5.1: Gantt Chart

Chapter 6

SOFTWARE REQUIREMENT SPECIFICATION

6.1 INTRODUCTION

6.1.1 Purpose and scope of document

Software requirement specification deals with all software perspectives for the dissertation.It discusses the use classes, use cases and usage consideration for them. It also focuses on data modeling, functional modeling along with software performance requirements.

The purpose of this activity is to identify the functional and non-functional requirements for the software system to build. The objective of this document is to fully understand the user requirement and determine how well the system is performing and whether it will meet the demands.

6.1.2 Overview of responsibilities of developer

Understand Problem Definition of the system.

Understand users expectations from the system.

Gather requirements for system.

Analyze requirements and design model.

Analyze which are the processes involved.

Efficient coding with use of appropriate data structure and algorithms.

Analyze what data are used or produced during the processes.

Complete the project successfully within given time.

Developer’s responsibilities with respect to this system:

To handle large number of available dataset.

To develop and apply K-Means algorithm on available dataset.

To develop and apply Incremental K-Means algorithm on available dataset.

To compare both algorithm considering purity of clustering.

To execute query and display result.

PRODUCT OVERVIEW

Clustering algorithms have been studied for decades, and the literature on the subject is huge. Here, the basic preprocessing of document is done. Existing Clustering algorithm is applied. New clustering algorithm is developed to overcome the drawbacks of existing one. Comparison of both algorithm is shown on the basis of the purity of formed cluster.Query is executed and results are evaluated.

USAGE SCENARIO

6.3.1 User Profile

General User:

Any authorized user of the system can apply the clustering technique on available data set.

Use Cases

Following modules are available in a system:

Pre-Processing Module

Calculating the number of clusters

Clustering techniques

Removing Outliers

Use Case View

A use case diagram is a type of behavioral diagram defined by the UML created from a use case analysis. Its purpose is to present a graphical overview of the functionality provided by a system in terms of actors, their goals represented as use case and any dependency between those use cases.

Figure 6.1: Use case diagram

6.4 DATA MODEL AND DESCRIPTION

Different data structures are used which are described in detail in chapter 7, section 7.3.

6.4.1 Data Description

This system propose a new efficient technique for document clustering which can be used in many applications.

Following is the sequence for the data flow across different modules:

1. Data from Dataset to preprocessing module.

2. After preprocessing is done, terms are calculated.

3. Centroid and distance calculation is done.

4. Cluster formation is done.

5. Similarity calculation is done.

6. Query is fired and results are evaluated.

6.5 FUNCTIONAL MODEL AND DESCRIPTION

6.5.1 Flow Diagram

A data flow Diagram is a graphical representation of all major steps, and how the data

flows through the system. The data flow diagrams for the system are shown in Figure 6.2 ,

Figure 6.3 and Figure 6.4.

Level 0 Data Flow Diagram:

Figure 6.2: DFD 0 diagram

Level 1 Data Flow Diagram:

Figure 6.3: DFD 1 diagram

Level 2 Data Flow Diagram:

Figure 6.4: DFD 2 diagram

Description of Function

A processing narrative for functions is presented in activity diagram as shown in Figure 6.5.

Each function is described in detail in Chapter 7, Section 7.2.

6.5.2 Non Functional Requirement

A non-functional requirement is a requirement that specifies criteria that can be used to judge the operation of a system rather than specific behavior. This should be contrasted with functional requirements that specific behavior or functions.

Accessibility:

This system is user friendly.

Accuracy:

This system provides clusters with good accuracy is compared to existing one.

Efficiency:

This new algorithm is highly efficient as it gives highly relevant documents to the specified query.

6.5.3 Functional Flow Diagram

Figure 6.5 : Functional Flow Diagram

6.5.4 Software Quality Attributes

Adaptability:

This system is adaptable in any kind of text mining environment.

Maintainability and Flexibility:

This system supports maintainability and flexibility properties.

This system is maintainable by keeping track of dataset.

Only text files have to be there in dataset.

The system is capable of making changes in addition of dataset.

Reusability:

Software reuse is the process of implementing or updating software systems using existing software components. A good software reuse process facilitates the increase of productivity, quality, and reliability, and the decrease of costs and implementation time. An initial investment is required to start a software reuse process, but that investment pays for itself in a few reuses. In short, the development of a reuse process and repository produces a base of knowledge that improves in quality after every reuse, minimizing the amount of development work required for future projects, and ultimately reducing the risk of new projects that are based on repository knowledge. This system can be reused in number of text mining applications.

6.5.5 Design Constraints

This system gives following design constraints:

1. Dataset has to be specified.

2. Number of cluster has to be defined.

6.5.6 Software Interface Description

In this system it provides a GUI for selecting datasets, applying K-Means and Incremental K-Means algorithm, comparing these two algorithms and executing query.It provides user friendly environment to user, so that user can easily used it.

6.6 RESTRICTIONS, LIMITATIONS, AND CONSTRAINTS

There are a few constraints of the system:

The design should be user friendly.

Data will be grouped into 4 clusters only.

6.7 VALIDATION CRITERIA

6.7.1 Classes of tests

Refer Chapter 8 for Classes of test.

6.7.2 Estimated Software response

Estimation of KLOC:

The number of lines required for implementation of various modules:

Efforts: E=3.2 * (KLOC) ^ (1.02)

E=3.2 * ( ) ^ (1.02)

Requirement Analysis required: 2 Months

Development time for Implementation & Testing required(In months):

D = E/N

D = /1

D =

D = + = months

Chapter 7

DETAILED DESIGN DOCUMENT

7.1 Introduction

7.1.1 Overview of products

Clustering algorithms have been studied for decades, and the literature on the subject is huge. Here, the basic preprocessing of document is done. Existing Clustering algorithm is applied. New clustering algorithm is developed to overcome the drawbacks of existing one. Comparison of both algorithm is shown on the basis of the purity of formed cluster.Query is executed and results are evaluated.

7.2 ARCHITECHTURAL DESIGN

7.2.1 System Architecture:

Clustering is sometimes erroneously referred to as automatic classification; however, this is inaccurate, since the clusters found are not known prior to processing whereas in case of classification the classes are pre-defined. In clustering, it is the distribution and the nature of data that will determine cluster membership, in opposition to the classification where the classifier learns the association between objects and classes from a so called training set, i.e. a set of data correctly labeled by hand, and then replicates the learnt behavior on unlabeled data.

Document clustering is the process of categorizing text document into a systematic cluster or group, such that the documents in the same cluster are similar whereas the documents in the other clusters are dissimilar. It is one of the vital processes in text mining. Text mining especially has gained a lot of importance and it demands various tasks such as production of granular taxonomies, document summarization etc., for developing a higher quality information from text.This system is providing both algorithms-K-Means, Incremental K-Means. Comparison of both algorithms is done. Finally query is executed to show the result.

Figure 7.1 : System Architecture

7.3 SYSTEM MODEL

1) Preprocessing Module

Before running clustering algorithms on text datasets, I performed some pre-processing steps. In particular, stop words (prepositions, pronouns, articles, and irrelevant document metadata) have been removed. Also, the Snow balls stemming algorithm for Portuguese words has been used. Then, I adopted a traditional statistical approach for text mining, in which documents are represented in a vector space model. In this model, each document is represented by a vector containing the frequencies of occurrences of words, which are defined as delimited alphabetic strings, whose number of characters is between 4 and 25.

I have also used a dimensionality reduction technique known as Term Variance (TV) that can increase both the effectiveness and efficiency of clustering algorithms. TV selects a number of attributes (in our case 100 words) that have the greatest variances over the documents. In order to compute distances between documents, two measures have been used, namely: cosine-based distance and Levenshtein-based distance. The later has been used to calculate distances between file (document) names only.

2) Calculating the number of clusters

In order to estimate the number of clusters, a widely used approach consists of getting a set of data partitions with different numbers of clusters and then selecting that particular partition that provides the best result according to a specific quality criterion (e.g., a relative validity index). Such a set of partitions may result directly from a hierarchical clustering dendrogram or, alternatively, from multiple runs of a partitional algorithm (e.g., K-means) starting from different numbers and initial positions of the cluster prototypes.

3) Clustering techniques

The clustering algorithms adopted in our study the partitional K-means and K-medoids, the hierarchical Single/Complete/Average Link, and the cluster ensemble based algorithm known as CSPA are popular in the machine learning and data mining fields, and therefore they have been used in our study. Nevertheless, some of my choices regarding their use deserve further comments. For instance, K-medoids is similar to K-means. However, instead of computing centroids, it uses medoids, which are the representative objects of the clusters. This property makes it particularly interesting for applications in which (i) centroids cannot be computed; and (ii) distances between pairs of objects are available, as for computing dissimilarities between names of documents with the Levenshtein distance.

4) Removing Outliers

I assess a simple approach to remove outliers. This approach makes recursive use of the silhouette. Fundamentally, if the best partition chosen by the silhouette has singletons (i.e., clusters formed by a single object only), these are removed. Then, the clustering process is repeated over and over again until a partition with-out singletons is found. At the end of the process, all singletons are incorporated into the resulting data partition (for evaluation purposes) as single clusters.

7.4 COMPONENT DESIGN

7.4.1 Class diagrams

Figure 7.2: Class Diagram

7.4.2 Interaction Diagrams

Sequence Diagram

A diagram showing the flow of information through the function and the transformationit undergoes is presented in Fig. 6.6. A sequence diagram is graphical view of a scenario that shows object interaction in time based sequence. What happens first, What happens next sequence diagram establish the roles of object and help provide essential information to determine class responsibilities and interface.

Figure 7.3: Sequence Diagram

7.4.3 Algorithms

Preprocessing Module:

Stopwords (prepositions, pronouns, articles, and irrelevant document metadata) have been removed. A typical method to remove stopwords is to compare each term with acompilation of known stopwords

Input:A document Data Base D and List of Stop words L

D={d1,d2,d3,….,dk} ;where 1<=k<=i

tij is the jth term in ith document

Output: All valid stem text term in D

for (all di in D) do

for (1 to j) do

Extract tij from di

If(tij in list L)

Remove tij from di

End for

End for

Word stemming is the process of suffix removal to general word stems. A stem is a natural group of words with similar meaning. The process of reducing words to their base form, or stem. For example, the words “connected,”“connection”, “connections” are all reduced to the stem “connect.” Porter’s algorithm is the de facto standard stemming algorithm.

The similarity between two documents i & j is computed as follows:

For i:=0 to N(total documents)

For j:=0 to N (total documents)

Simvalue :=((doc[i]*doc[j]))/Math.sqrt(doc[i]*doc[j])

Addsimvalue to the list Build_matrix;

Next

Next

Where N is total number of documents

doc[i] for i=1,2…..n are documentsCalculating the number of Clusters:

In this step, only the numbers of clusters are feed as input. It is represented by K.

Clustering Techniques:

K means & improved method are used.

Steps of K Means method:

This algorithm consists of four steps:

1. Initialization: In this first step,data set, number of clusters and the centroid that

I have defined for each cluster.

2. Classification: The distance is calculated for each data point from the centroid

and the data point having minimum distance from the centriod of a cluster is

assigned to that particular cluster.

3. Centroid Recalculation: Clusters generated previously, the centriod is again

repeatly calculated means recalculation of the centriod.

4. Convergence Condition :Some convergence conditions are given as below:

4.1 Stopping when reaching a given or defined number of iterations.

4.2 Stopping when there is no exchange of data points between the clusters.

4.3 Stopping when a threshold value is achieved.

5. Repeat : If all of the above conditions are not satisfied, then go to step 2 and the

whole process repeat again, until the given conditions are not satisfied

Steps of improved method:

Output: D = {d1, d2, d3,…, di,…, dn } //set of documents

di = { x1, x2, x3,…, xi,…, xm } k // Number of desired clusters.

Input: A set of k clusters.

Working Procedure:

1. Calculate distance for each document or data point from the origin

2. Arrange the distance (obtained in step 1) in ascending order.

3. Split the sorted list in K equal size sub sets. Also the middle point of each sub

set is taken as the centroid of that set.

4. Repeat this step for all data points. Now the distance between each data point &

all the centroids is calculated. Then the data set is assigned to the closest cluster.

5. In this step, the centroids of all the clusters are recalculated.

6.Now for all data points. Now the distance between each data point & all the

centroids is calculated. If this distance is less than or equal to the present nearest

distance then the data point stays in the same cluster. Else it is shifted to the

nearest new cluster.

7.5 PROGRAM STRUCTURE

7.5.1 Architecture diagram

Refer 7.2 for architecture diagram.

7.5.2 DESCRIPTION FOR COMPONENT

Refer Detailed Design of this chapter for description of components.

Chapter 8

IMPLEMENTATION

8.1 IMPLEMENTATION

This system implements existing and new algorithm.Both algorithms are compared on the basis of purity. Finally query is executed and results are displayed.

8.2 SNAP SHOTS

8.2.1. First Screen

8.2.2. Selecting Dataset

8.2.3. Defining number of clusters and assigning documents to clusters.(Randomly)

8.2.4. K-Means: Calculating Term Frequency

8.2.5. Calculating centroid and distance from centroid to all documents

8.2.6. K-Means: Assigning documents to nearest cluster till no repetition of movement of

documents form one to another cluster. And viewing documents in selected cluster

8.2.7. Similarity Measurement

8.2.7. Similarity Measurement

8.2.8. Finding Dissimilarity

8.2.9. Finding Outlier Document

8.2.10. Moving Outlier Document

8.2.11. Comparing purity of both Algorithms

8.2.12. Executing Query

Chapter 9

TEST SPECIFICATION

9.1 INTRODUCTION

This document describes both test plan and the test procedure along with goals and objectives of test specifications.

9.1.1 Goals and Objectives

Tests are the individual tests specified in a test plan document. Each test is typically described by:

An initial system state.

A set of actions to be performed.

The expected results of the test.

9.2 FORMAL TECHNICAL REVIEWS

Formal Technical Reviews and Inspections of documents or software are performed to

identify and remove defects. The Formal Technical Review of my project was carried at

regular intervals in the form of stand up meetings and brainstorming sessions conducted inpresence of the guide. The process included verification of the checklist which was developedfor the review process and the code review checklist template is as follows:

Does the code conform to Hungarian Notations?

Is the code well-structured, consistent in style and consistently formatted?

Are all variables properly defined with meaningful, consistent and clear names?

Are there any redundant or unused variables?

Does the code consist of comments?

Is the code error free?

Does the code handle exceptions?

Does the code verify class name according to the standard format?

Does the code avoid nested loop variables?

Code refactoring (maximum 40 lines of code in a function).

9.3 TEST PLAN AND SCHEDULE

The purpose of testing is to discover errors. Testing is the process of trying to discoverevery conceivable fault or weakness in a work product. It provides a way to check thefunctionality of components, subassemblies, assemblies and/or a finished product. It is theprocess of exercising software with the intent of ensuring that the software system meets itsrequirements and user expectations and does not fail in an unacceptable manner. There arevarious types of test. Each test type addresses a specific testing requirement.

1. Test Plan

2. Testing Strategy

3. Test Schedule

9.3.1 Test Plan

Test cases are planned in accordance to the test process and documented with detailedtest descriptions. These test cases use cases based on projected operational mission scenarios.The testing process also includes stress / load testing for stability purpose (i.e., at 95percentCPU use, system stability is still guaranteed). The test process thoroughly tests the interfacesand modules. Software testing includes a traceable white box testing, black box testingand other test processes verifying implemented software against design documentation andrequirements specified.

Unit testing:

Unit testing involves the design of test cases that validate the internal programlogic is functioning properly, and that program inputs produce valid outputs. All decision branches and internal code flow should be validated. It is the testing of individualsoftware units of the application .It is done after the completion of an individual unitbefore integration. This is a structural testing, that relies on knowledge of its constructionand is invasive. Unit tests perform basic tests at component level and test a specific business process, application, and/or system configuration. Unit tests ensure that each unique path of a business process performs accurately to the documented specifications and contains clearly defined inputs and expected results.

Integration testing:

Integration tests are designed to test integrated software components to determine if they actually run as one program. Testing is event driven and is more concerned with the basic outcome of screens or fields. Integration tests demonstrate that although the components were individually satisfaction, as shown by successfully unit testing, the combination of components is correct and consistent. Integration testing is specifically aimed at exposing the problems that arise from the combination of components.

Functional testing:

Functional tests provide systematic demonstrations that functions tested are available as specified by the business and technical requirements, system documentation and user manuals. Functional testing is centered on the following items:-

Valid Input: Identified classes of valid input must be accepted.

Invalid Input: Identified classes of invalid input must be rejected.

Functions: Identified functions must be exercised.

Output: Identified classes of application outputs must be exercised.

Systems/Procedures: Interfacing systems or procedures must be invoked.

Organization and preparation of functional tests is focused on requirements, key functions or special test cases. In addition, systematic coverage pertaining to identify Business process flows, data fields, predefined processes, and successive processes must be considered for testing. Before functional testing is complete, additional tests are identified and the effective value of current tests is determined.

System Test:

System testing ensures that the entire integrated software system meets requirements. It tests a configuration to ensure known and predictable results. An example of system testing is the configuration oriented system integration test. System testing is based on process descriptions and flows, emphasizing predriven process links and integration points.

White Box Testing:

White Box Testing is a testing in which in which the software tester has knowledge of the inner workings, structure and language of the software, or at least its purpose is used to test areas that cannot be reached from a black box level.

Black Box Testing:

Black Box Testing is testing the software without any knowledge of the inner workings, structure or language of the module being tested. Black box tests, as most other kinds of tests must be written from a definitive source document, such as specification or requirements document. It is a testing in which the software under test is treated as a black box that you cannot see into it. The test provides inputs and responds to outputs without considering how the software works.

9.3.2 Testing Strategy

Field testing will be performed manually and functional tests are written in detail.

Test objectives

The entry screen, messages and responses must not be delayed.

Features To Be Tested:

All links should take the user to the correct form.

Software integration testing is the incremental integration testing of two or more integrated software components on a single platform to produce failures caused by interface defects.

9.3.3 Test Schedule

Table 9.1: Test schedule

9.4 TESTING TOOLS AND ENVIRONMENT

Tools for test data derivation (e.g., synthesizing data from path condition)

Tools for test evaluation(e.g., various coverage metrics)

Tools for testing other software qualities

9.4.1 Test work products

To support new technology

To support new software processes

To support a particular method or methodology

9.4.2 Software to be tested

The software to implement this system may give incorrect output if the software usedin this system is faulty. Following are the software used in my system

Test the WampServer

Test the Netbeans

9.5 TEST CASES AND RESULT

Table 9.2: Test Cases

9.6 RESULT ANALYSIS

The proposed algorithm proves the purity of the formed clusters using new algorithm is better than the existing K-Means algorithm. This System can be used in number of text mining application.

9.7 ANALYSIS:

At the initial stage, I have implemented the algorithm for the predefined number of clusters only and the results obtained are positive.

CONCLUSION AND FUTURE SCOPE

Conclusion

As clustering plays a very vital role in various applications, many researches are still being done. The upcoming innovations are mainly due to the properties and the characteristics of existing methods. These existing approaches form the basis for the various innovations in the field of clustering.

In this system, I have implemented the existing document clustering;K-Means and the new Incremental algorithm. The purity of the formed clusters using Incremental algorithm is better than the existing one. This new algorithm can be used in many text mining applications.

Future Scope:

This system can be implemented for clustering as per user’s choice for number of clusters. The system will be used in Newsgroup application.

Appendices

Appendix A

LIST OF PUBLICATIONS

[1] Priti B.Kudal,Prof. M.M.Naoghare,”A Review of Modern Document Clustering Techniques”,International Journal of Science & Research(IJSR), ISSN (Online): 2319- 7064,Impact Factor (2012): 3.358 ,Volume 3 Issue 10, October 2014.

[2] Priti B. Kudal, Prof. ManishaNaoghare, “An Improved Hierarchical Technique for

Document Clustering”, International Journal of Science &Research(IJSR),ISSN

(Online): 2319-7064.Impact Factor (2012): 3.358 Volume 4 Issue 4, April 2015.

[3] Priti B. Kudal, Prof. ManishaNaoghare,“An Improved Technique for Document

Clustering,International Journal of Technical Research and Applications e-ISSN: 2320-

8163,Volume 3, Issue 3 (May-June 2015), PP. 169-172.

LIST OF CONFERENCES

Appendix B

MATHEMATICAL MODEL

Set Theory

Let,

I is a set of input i.e. Dataset.

F is the set of functions used for the implementation.

O is the output.

S= (I, F, O)

I: Input Data set

F: Set of Functions

O: Set of Output

F=F1,F2,……………………..F10.

F1: Document Preprocessing

F2: Assigning Cluster IDs

F3: Computing Term Frequency

F4: Centroid & Distance Calculation

F5: Cluster Formation

F6: Incremental Algorithm

F7: Similarity Calculation & Finding Dissimilarity

F8: Moving Document

F9: Calculating Purity & Analysis by Graph

F10: Query Execution

F1: Document Preprocessing

X: Preprocessing of all documents in selected Dataset is done.

F(X): Preprocessing including stop word removal, stemming of all documents in

selected Dataset is done.

F2: Assigning Cluster IDs

X: Initially random cluster ids are assigned to files in selected dataset.

F(X): Initially random cluster ids are assigned to files in selected dataset.

F3: Computing Term Frequency

X: Occurrence of all terms (words) in a document is calculated.

F(X): Number of times the word has occurred is calculated.

F4: Centroid& Distance Calculation

X: Initial Cluster centroid and distance of all data points from centroid is calculated.

F(X): Initial Cluster centroid and distance of all data points from centroid is

calculated.

F5: Cluster Formation

X: According to the distance calculated, clusters are formed.

F(X): According to the distance calculated, clusters are formed.

F6: Incremental Algorithm

X: Incremental Algorithm

F(X): Distance are arranged in ascending order, centercalculation, cluster formation

done.

F7: Similarity Calculation & Finding Dissimilarity

X: Similarity between document and dissimilarity value is calculated.

F(X): According to the selected cluster, similarity between documents is calculated.

And dissimilarity value is calculated.

F8: Moving Document

X: Document is moved to the related cluster.

F(X): Document is moved to the related cluster.

F9: Calculating Purity & Analysis by Graph

X: Purity is calculated and graph is displayed.

F(X): Purity is calculated and graph is displayed.

F10: Query Execution

X: Query Execution

F(X): Query is executed. Word is given for search and the documents containing that

word are listed with the weight.

Figure: Mathematical Model

Appendix C

PLAGIARISM REPORT

References

[1] IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 26, NO.

7,JULY 2014.”A Similarity Measure for Text Classification and Clustering”
Yung-Shen Lin, Jung-Yi Jiang, and Shie-Jue Lee, Member, IEEE.

[2] IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO.

10,OCTOBER 2014.”Efficient Phrase-based Document Indexing for Web Document

Clustering”,Khaled M.Hammouda,Mohamed S. Kamel.

[3] IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 20, NO.

9,SEPTEMBER 2008.”Efficient Phrase-based Document Similarity for Clustering”,Hung

Chim and Xiaotie Deng.

[4] B.K.L.Fei,J.H.P.Elo,H.S.Venter,andM.S.Oliver,\Exploring forensic data with self-

organizing maps", in Proc. IFIP Int. Conf. Digital Forensics, 2005, pp. 113-123.

[5] N. L. Beebe and J. G. Clark, \Digital forensic text string searching: Improving information

Retrieval effectiveness by thematically clustering search results", Digital Investigation,

Elsevier, vol. 4, no. 1, pp. 49-54, 2007.

[6] R. Hadjidj, M. Debbabi, H. Lounis, F. Iqbal, A. Szporer, and D. Benredjem,\Towards an

integrated e-mail forensic analysis framework", Digital Investigation, Elsevier, vol. 5, no.

34, pp. 124-137, 2009.

[7] F. Iqbal, H. Binsalleeh, B. C. M. Fung, and M. Debbabi, \Mining writeprints from

anonymous e-mails for forensic investigation", Digital Investigation, Elsevier, vol. 7, no. 12,

pp. 56-64, 2010.

[8] Text Clustering for Digital Forensics Analysis,Sergio Decherchi1, Simone Tacconi2,

Judith Redi1, Alessio Leoncini1, Fabio Sangiacomo1 and Rodolfo Zunino1.

[9] K. Stoffel, P. Cotofrei, and D. Han, “Fuzzy methods for forensic data analysis,” in Proc.

IEEE Int. Conf. Soft Computing and Pattern Recognition, 2010, pp. 23–28.l

[10] Nicole Lang Beebe,Jan Guynes Clark,”Digital Forensic text string searching:Improving

information retrieval effectiveness by thematic ally clustering search results.

[11]Hierarchical Clustering Algorithms for Document Datasets∗,Ying Zhao and George

Karypis,Department of Computer Science, University of Minnesota, Minneapolis, MN

55455.

[12]L. Vendramin, R. J. G. B. Campello, and E. R. Hruschka, “Relative clustering validity

criteria: A comparative overview,” Statist. Anal.Data Mining, vol. 3, pp. 209–235, 2010.

[13]B.K.L.Fei,J.H.P.Eloff,H.S.Venter,andM.S.Oliver,“Exploring forensic data with self-

organizing maps,” in Proc. IFIP Int. Conf. Digital Forensics, 2005, pp. 113–123

[14]B. Mirkin, Clustering for Data Mining: A Data Recovery Approach. London, U.K.:

Chapman & Hall, 2005.

[15]A. L. N. Fred and A. K. Jain, “Combining multiple clusterings using evidence

accumulation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 27, no. 6, pp. 835–850, Jun.

2005.

[16] L. Hubert and P. Arabie, “Comparing partitions,” J. Classification, vol. 2, pp. 193–218,

1985.

[17] C. M. Bishop, Pattern Recognition and Machine Learning.New York: Springer-Verlag,

2006.

[18] S. Haykin, Neural Networks: A Comprehensive Foundation.Englewood Cliffs, NJ: Prentice-

Hall, 1998.

Similar Posts