S.L. dr. ing. Bujor [617597]

UNIVERSITY “POLITEHNICA” OF BUCHAREST
FACULTY OF ENGINEERI NG IN FOREIGN LANGUA GES
COMPUTERS AND INFORM ATION TECHNOLOGY

DIPLOMA PROJEC T

Bucharest
2013 Project coordinator:
S.L. dr. ing. Bujor
PĂVĂLOIU
Student: [anonimizat]:
S.L. dr. ing. Bujor
PĂVĂLOIU Roxana – Teodora
RĂDULESCU

Bucharest
2013 UNIVERSITY “POLITEHN ICA” OF BUCHAREST
FACULTY OF ENGINEERI NG
IN FOREIGN LANGUAGES
COMPUTERS AND INFORM ATION TECHNOLOGY

“POLITEHNICA” UNIVER SITY OF BUCHAREST
FACULTY OF ENGINEERING IN FO REIGN LANGUAGES
COMPUTERS AND INFORM ATION TECHNOLOGY

Approved
Director of department :
Prof. dr. ing. Georg e DRĂGOI

DIPL OMA PROJECT THEME FOR :
Roxana -Teodora RĂDULESCU

1. Theme title:
Topic Classification in Social Media

2. Initial design data:
• Design a crawler for web logs
• Acquire data from web logs and put it into a database.
• Make the classification.
3. Student: [anonimizat] :
• Implement the crawler
• Create the database connected
• Test classification algorithms
• Creat e dictionaries
4. Compulsory graphical material:
• Crawler block scheme
• Database diagram
• UML diagrams
5. The paper is based on the knowledge obtained at the following study courses :
Programming Languages, Introduction to Web Programming, Object Oriented
Prog ramming, Databases, Algorithm Design and Complexity, Software Development
Methods, Neural Networks and Genetic Algorithms, Web Application Development,
Software Design Techniques, Artificial Intelligence.

6. Paper development environment :
Java, JSF, NetBeans IDE 7.2, WampServer, MySQL

7. The paper serves as :
Research and didactic purposes.

8. Paper preparation date :
June 2013

Project coordinator Student:
S.L.dr.ing. Bujor PĂVĂLOIU Roxana – Teodora
RĂDULESCU

Academic Honesty Statement

I, Roxana -Teodora RĂDULESCU , hereby declare that the work with the title
―Topic Classification in Social Media ‖, to be openly defended in front of the diploma
theses examination commission at the Faculty of Engineering in Foreign Languages,
University "Politehnica" of Bucharest, as partial requirement for obtaining the title of
Engineer is the result of my own work , based on my study .

The thesis, simulations, experiments and measurements that are presented are made
entirely by me under the guidance of the scientific adviser, without the implication of
persons that are not cited by name and contribution in the Acknowledgements part.

The thesis has never been presented to a higher education institution or research
board in the country or abroad.

All the information used, including the Internet, is obtained from sources that were
cited and indicated in the notes and in the bibliography, according to ethical standards. I
understand that plagiarism is an offense and is punishable under law.

The results from the simulations, experiments and measurements are genuine.
I understand that the falsification of data and r esults constitutes fraud and is punished
according to regulations.

Roxana – Teodora RĂDULESCU Date

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

Table of Contents

1. Introduction ………………………….. ………………………….. ………………………….. ………………… 1
1.1 Introduction ………………………….. ………………………….. ………………………….. ………………………. 1
1.1.1 Motives ………………………….. ………………………….. ………………………….. ……………… 1
1.2 Advantages ………………………….. ………………………….. ………………………….. ………………………… 2
2. Presentation of the Domain ………………………….. ………………………….. ………………………. 3
2.1 Social Network Sites ………………………….. ………………………….. ………………………….. …………… 3
2.2 Social Networks in Rom ania ………………………….. ………………………….. ………………………….. .. 7
2.2.1 Romanian Social Networks ………………………….. ………………………….. ……………….. 8
2.3 Social Network Analysis ………………………….. ………………………….. ………………………….. ……… 8
2.3.1 Matrix representation ………………………….. ………………………….. ……………………….. 8
2.3.2 Graph representation ………………………….. ………………………….. ………………………… 9
2.3.3 Population samples and boundaries ………………………….. ………………………….. ….. 11
2.3.4 Modality and analysis levels ………………………….. ………………………….. ……………. 12
2.4 Social Media ………………………….. ………………………….. ………………………….. …………………….. 12
2.4.1 Blogs ………………………….. ………………………….. ………………………….. ……………….. 13
2.4.2 Microblogs ………………………….. ………………………….. ………………………….. ……….. 13
2.4.3 Blogosphere ………………………….. ………………………….. ………………………….. ……… 13
2.4.4 Blogging Platforms ………………………….. ………………………….. ………………………… 14
2.5 Web Search ………………………….. ………………………….. ………………………….. ……………………… 14
2.5.1 Web Mining ………………………….. ………………………….. ………………………….. ……… 15
2.5.1.1 Web Structur e Mining ………………………….. ………………………….. ………………. 16
2.5.1.2 Web Usage Mining ………………………….. ………………………….. …………………… 16
2.5.1.3 Web Content Mining ………………………….. ………………………….. ………………… 17
2.5.2 Web Crawling ………………………….. ………………………….. ………………………….. …… 18
2.5.2.1 Guidelines ………………………….. ………………………….. ………………………….. …… 19
2.5.2.2 Crawler Architecture ………………………….. ………………………….. ………………… 19
2.5.2.3 Crawler classification ………………………….. ………………………….. ……………….. 20
2.6 Topic Classification ………………………….. ………………………….. ………………………….. ………….. 20
2.6.1 Problem statement ………………………….. ………………………….. ………………………….. 21
2.6.2 Machine Learning ………………………….. ………………………….. ………………………….. 21
2.6.2.1 Pruning ………………………….. ………………………….. ………………………….. ………. 23
2.6.2.1.1 Stemming ………………………….. ………………………….. ………………………….. .. 23
2.6.2.1.2 Stoplisting ………………………….. ………………………….. ………………………….. . 23
2.6.2.2 Feature selection ………………………….. ………………………….. ………………………. 24
2.6.2.3 TF-IDF ………………………….. ………………………….. ………………………….. ………. 24
2.6.3 Machine Learning Approaches ………………………….. ………………………….. ………… 26
2.6.3.1 Naïve Bayes Classification ………………………….. ………………………….. ………… 26
2.6.3.2 K-Nearest Neighbour ………………………….. ………………………….. ……………….. 27

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

2.6.3.3 Support Vector Machines ………………………….. ………………………….. ………….. 27
2.6.3.4 Clustering Algorithms ………………………….. ………………………….. ………………. 29
2.6.3.4.1 Hierarchical Algorithms ………………………….. ………………………….. ………… 29
2.6.3.4.2 Partitional Algorithms ………………………….. ………………………….. …………… 30
2.6.3.5 Decision Trees ………………………….. ………………………….. …………………………. 30
2.6.3 .6 Neural Networks ………………………….. ………………………….. ……………………… 32
2.6.3.6.1 Neural Networks Architecture ………………………….. ………………………….. .. 33
2.6.3.6.2 Learning Methods ………………………….. ………………………….. ………………… 34
2.6.3.6.3 Types of Neural Networks ………………………….. ………………………….. …….. 34
2.7 Evaluating Classifiers ………………………….. ………………………….. ………………………….. ……….. 35
2.7.1 Table of Confusion ………………………….. ………………………….. ………………………… 35
2.8 State of the Art ………………………….. ………………………….. ………………………….. …………………. 36
2.8.1 uClassify ………………………….. ………………………….. ………………………….. ………….. 36
3. Designing the Application ………………………….. ………………………….. ………………………. 38
3.1 General overview ………………………….. ………………………….. ………………………….. ……………… 38
3.1.1 Business Model ………………………….. ………………………….. ………………………….. …. 38
3.2 Requirement Analysis ………………………….. ………………………….. ………………………….. ………. 39
3.2.1 Use Case Description ………………………….. ………………………….. ……………………… 39
3.2.1.1.1 Configure Classifier ………………………….. ………………………….. ……………… 40
3.2.1.1.2 Set URL ………………………….. ………………………….. ………………………….. …. 40
3.2.1.1.3 Get URL Classification ………………………….. ………………………….. …………. 41
3.2.2 System Sequence Diagram ………………………….. ………………………….. ……………… 41
3.2.3 Functional Requirements ………………………….. ………………………….. ………………… 42
3.2.4 Non-functional requirements ………………………….. ………………………….. …………… 42
4. Application Design and Implementation ………………………….. ………………………….. ….. 43
4.1 Information System Tier ………………………….. ………………………….. ………………………….. …… 44
4.2 Business Tier ………………………….. ………………………….. ………………………….. ……………………. 48
4.2.1 Design principles ………………………….. ………………………….. ………………………….. . 49
4.2.1.1 Dependency Inversion Principle ………………………….. ………………………….. … 49
4.2.1.2 Low coupling ………………………….. ………………………….. ………………………….. . 49
4.2.1.3 High Cohesion ………………………….. ………………………….. …………………………. 49
4.2.1.4 Design Patterns ………………………….. ………………………….. ………………………… 49
4.2.1.5 Factoring ………………………….. ………………………….. ………………………….. …….. 49
4.2.2 Persistence Layer ………………………….. ………………………….. ………………………….. . 49
4.2.3 Crawler Module ………………………….. ………………………….. ………………………….. … 51
4.2.4 Classification Module ………………………….. ………………………….. …………………….. 55
4.2.4.1 TF-IDF Classifier ………………………….. ………………………….. …………………….. 55
4.2.4.1.1 Obtaining the Keyword Lists ………………………….. ………………………….. …. 55
4.2.4.1.2 Algorithm description ………………………….. ………………………….. …………… 57
4.2.4.2 Text Classifier using Weka ………………………….. ………………………….. ……….. 59
4.2.4.2.1 Available Classifiers ………………………….. ………………………….. …………….. 59
4.2.4.2.2 Building the Training Set ………………………….. ………………………….. ………. 60

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

4.2.4.2.3 Algorithm Description ………………………….. ………………………….. ………….. 61
4.3 Web Tier ………………………….. ………………………….. ………………………….. ………………………….. 63
5. Libraries ………………………….. ………………………….. ………………………….. …………………… 65
5.1 Jsoup ………………………….. ………………………….. ………………………….. ………………………….. …… 65
5.2 Weka ………………………….. ………………………….. ………………………….. ………………………….. …… 65
6. Conclusions ………………………….. ………………………….. ………………………….. ………………. 67
6.1 Performance Analysis ………………………….. ………………………….. ………………………….. ………. 67
6.2 Future Development and Improvements ………………………….. ………………………….. ………… 76
References ………………………….. ………………………….. ………………………….. …………………………. 77
Glossary ………………………….. ………………………….. ………………………….. ………………………….. … 79
Annexes ………………………….. ………………………….. ………………………….. ………………………….. … 81

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

1 1. Introduction

1.1 Introduction

Online social activities have met a dramatic increase in the past decade. The public
availability and the continuous growth of social networks provide a rich environment in
which social behaviour and relationships among entities can be observed. Additionally social
and behavioral scientists consider this to be a new leverage for answering questions, through
definitions of aspects of the economic, social or political structural environment .
Moreover, in the past decade, there has been an increase of the document usage and
information consumption in digital form. Areas like journalism, education, entertainment,
literature, communication, etc. have all suffered the phenomenon of what can be
appropriately called virtual migration. Press, books, magazines, scientific publications, etc.
have acquired a more maneuverable and convenient form, in the digital universe.
Classification has always been a necessary task in any environment, because keeping
track of everything can save a lot of resources like time, energy, money. Stores, companies,
libraries, educational institutes always needed to know where items, registries, employees,
books, personal records, etc. were. So at first this wa s done manually, a tedious task
nevertheless. The process of digitalization has also offered the opportunity of use virtual
methods in order to manage and process this vast amount of documents.
This current work mainly focuses on the blogging (the act of p osting to a weblog)
activity and all the content generated by it in Romania. We try to presents a hands -on
approach to web mining and information retrieval in the field of social media, followed by an
attempt to classify the content of the gathered data us ing various text classification methods.
The classification will be made according to some representative topics and the targeted
media is, as mentioned before, the Romanian blogosphere.

1.1.1 Motive s

All over the world, nations have started to evolve and tend to form what were defined
as knowledge -based societies. ―A knowledge -based society refers to the type of society that is
needed to compete and succeed in the changing economic and political dynamics of the
modern world. It refers to societies that are wel l educated, and who therefore rely on the
knowledge of their citizens to drive the innovation, entrepreneurship and dynamism of that
society’s economy.‖ (OAS, 2013)
Fundamental bases for this informational society are the socia l network services,
which are virtual communities that offer users the possibility to create personal profiles and
exchange information, opinions, facilitating socialization based on common preferences or
interests. Virtual social networks mainly stand out due to their ability to facilitate information
exchanges among users, becoming an ―authoritative and influential source of knowledge‖

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

2 (Ajith Abraham, 2010) . Elements like scalability, interoperability, openness are amplified in
this knowledge -based society. A key ingredient for the success of virtual social networks is
probably the individual’s need to affiliate with groups in order to obtain information, create a
social identity or fulfill goals. However there are also downsi des to this phenomenon. The
nature of the interaction creates an informal and artificial medium which lacks the means of
expressing emotions (unlike face -to-face communication). This may allow people to explore
their identities, surpass social inhibitions but it can also lead to severe psychological
problems related to real -life problem neglecting, isolation. Other reasons for affiliation to
social networks can include recreation, seeking friendship/social contact, obtain ing support
when dealing with differ ent situations, etc.
In the following chapters we will perform a study on social networks created around
blogs (web logs), including representation, modeling and analysis problems in order to
evaluate and classify Romanian blogs with respect to what we co nsider to be some
representative topics.
This work brings together two fields of interest: topic classification and social media
analysis. Text classification is not a new subject, published papers on the matter trying to
present various algorithms to solv e this problem; nevertheless a final answer could never be
achieved. On the other hand, social media, at least in Romania, has not actually been tackled
with at this level. The indisputable expansion of the blogging phenomenon cannot be
overlooked, because it has brought about openness, new opportunities, and careers to our
society .

1.2 Advantages

From a social point of view, topic classification can be considered a point of interest
because it can lead to trend and pattern tracing, studies of group characteristics, community
behavior and organization, retrieval of individual documents.
For a comput er scientist the matter of topic classification poses a very interesting, but
yet intricate problem. Not only one should process a continually increasing amount of data,
but one should also deal with the indisputable challenge of parsing natural language. The
complexity and flexibility of the human language has made it almost impossible to grasp for
machines. Putting aside the challenges, solving the problem of topic classification can bring
benefits like: time and cost reduction if done in administration a reas, solid study data for
social sciences, and better indexing and search mechanisms of the digital data.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

3 2. Presentation of the Domain

2.1 Social Network Sites

―Social Network Sites as web -based services that allow individuals to (1) construct a
public or semi -public profile within a bounded system, (2) articulate a list of other users with
whom they share a connection, and (3) view and traverse their list of connections and those
made by others within the system.‖ (Danah M. Boyd, 2008)
As one can easily understand from the above definition, SNSs (Social Network Sites)
can be considered a form of computer -mediated communication (CMC), communication –
based transaction that occurs through the use minimally two electronic devices.
The uniqueness of a s ocial network site comes not only from allowing individuals to
meet/to connect to total strangers, but rather from the possibility they create for the user to
reveal and articulate his/her social network. Connections, that could otherwise not be
possible, are now available at a push of a button. But this may not be the primary goal of a
user, but rather to maintain a constant communication flow with people who are already part
of their network.
The backbone of SNSs consists of visible profiles that display personal information
about the user, perhaps some multimedia content (images, videos) uploaded by the user and
of course a list of ―friends‖, profiles linked to the current one, also users of the system.
Profiles are unique pages, virtual representation o f the user, containing his history,
personal/professional development, achievements. After joining an SNS, an individual is
asked to fill out forms containing a series of questions. The profile is generated using the
answers to these questions, which genera lly include label such as age, location, interests, and
an ―about me‖ section. Most sites also encourage users to upload a profile photo. Some sites
allow users to personalize their profiles by adding multimedia content.
Profile visibility is the property th at allows the user to adjust the level of intimacy of
his page, according to his discretion. By default, profiles on Friendster and Tribe.net are
crawled by search engines, making them visible to anyone, regardless of whether or not the
viewer has an accoun t. Alternatively, LinkedIn controls what a viewer may see based on
whether she or he has upgraded to a paid account (Business, Business Plus or Executive) .
Sites like MySpace allow users to choose if they want their profiles to be public or ―Friends
only‖. Face book takes a different approach: by default, users who are part of the same
―network‖ can view each other’s profiles, unless a profile owner has decided to deny
permission to those in their network. Facebook also offers the users to control not only wh at
and whom can see their profile (called ―Timeline‖), but also what and whom can post on their
page , thus they can construct custom intimacy setting s.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

4
Figure 1 Timeline for major SNSs launch dates
Most SNSs require confirmation from both sides for ―Friendship‖, but some do not.
These one -directional ties are sometimes labeled as ―Fans‖ or ―Followers‖. The term
―Friends‖ can be misleading, because it can only repr esent a connection and not necessarily
friendship in everyday life. Some people can be very popular online, having hundreds of
friends, but in real life only know a small fraction of them all.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

5 The public display of connections is a crucial component of SNSs. The Friends list
(contains links to the linked profiles) enable viewers to traverse the network graph. Most
SNSs also provide a mechanism for users to leave messages on their Friends’ profiles. This
feature typically involves leaving ― comments‖, a lthou gh sites adopt various labels for this
feature. In addition, SNSs often have private messaging features and group messaging .
Moreover, ―Facebook‖ for example also offers the possibility to create and join groups, based
on common interests, activities, prof essions, etc. The group also gets a smaller network
where members can share information among each other, with no access from the exterior.
Beyond profiles, friends, comments, and private messaging, SNSs vary in their
features and user target. Some have pho to-sharing or video -sharing capabilities; others have
built-in blogging and instant messaging technology. There are mobile specific SNSs (e.g.,
Dodgeball , replaced by Google Latitude, after being bought by Google ), but some web -based
SNSs also support limit ed mobile interactions (e.g., Facebook, MySpace). Many SNSs target
people from specific geographical regions or linguistic groups. There are even SNSs for dogs
(Dogster) and cats (Catster), although their owners must manage their profiles.
At the moment Face book is the number 1 social network at a global level . Next we
will present some reports as they were posted on (Socialbakers, 2013) , company offering
monitoring and tracking tools for analysis of social networks, regarding the shift of the user
situation in the period January -December 2012 worldwide:

Figure 2 Continent ranking on Facebook
The overall ranking of the continents was shifted in 2012. North America officially
lost the dominant position and actually dropped down to the third place , as it was surpassed
by Europe and Asia. Asia took the leading position from Europe in August, becoming the
largest Facebook continent with over 270 m illion Facebook users. This was possible due to
countries like India, Japan, Indonesia, Vietnam, South Korea and Thailand.
At the country level, Brazil was the dominant leader in absolute growth ranking,
becoming the second biggest Facebook country in the world. It is closely followed by India.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

6 Figure 3 and Figure 4 present the top 10 fastest growing countries of 2012.

Figure 3 Top 10 of the absolute growth of monthly active users (MAU) from January –
December 2012

Figure 4 Top 10 of the fastest growing Facebook countries of 2012

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

7 LinkedIn (a social network designed for professionals, base d on career, jobs,
education, skills) has approximately 168 745 000 users, top countries being United States (78
million users), India (20 million users), United Kingdom (11.5 million users), Canada (6.5
million users) and Australia (4 million users).

2.2 Social Networks in Romania

Facebook, the larges t social network at a worldwide level, also maintains its position
in Romania. According to (Socialbakers, 2013) , at the beginning of 2013 there were over 5.6
million Facebook users in Romania, with a growth of about 661 000 in the last 6 months.
Romania was back then ranked at the position 31 at a global level. Therefore about one
quarter of the Romanian population has a Facebook account, and moreover about two thirds
of the internet users are Facebook members.
In Figure 5 we present some interesting facts regarding the user age distribution and
male/female ratio in Romania (Socialbakers, 2013) . The largest age group is 18 -24, follo wed
by 25 -34, thus high school and university students are the predominant group in this area.
Looking at Bulgaria, neighbour country of Romania, the dominant age group is currently 25 –
34, followed closely by 18 -24. There is also a equal gender ration 50% male, 50% female
users in Romania , compared to 48% male, 52% female users in United Kingdom, 65% male
and 35% female in Turkey , 49% male and 51% female in France and Bulgaria.

Figure 5 User age distribution and Male/Female Ration in Romania
In the realm of microblogging Twitter is the indisputable leader . Recently many
compani es not only offer latest updates about their product via Twitter, but also post job
opportunities. HR specialists are looking to extend recruitments through Social Networks in
Romania. Advertisements on Twitter are also an important source of income.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

8 2.2.1 Romanian Social Networks

There are a lso numerous social networking websites created in Romania, having only
Romanian users. Unfortunately their evolution is not consistent or tracked; few important
sites have managed to gather thousands of users into aggregated communities.
Classic networks based friendship relations and sharing common interest are
represented by : Friendsbook (http://friendsbook.ro/), Zang (http://zang.ro/), Zumzi
(http://zumzi.ro/ ). Social networks for entertainment are Trilulilu (http://trilulilu.ro/),
CineMarx (http://cine marx.ro/), Doizece (http://doizece.ro/) and Pixiland ( http://pixiland.ro/ ).
There are also websites dedicated to professional or occupational interests: Regielive
(http://www.regielive.ro/ ) for students, Afaceri (http://www.afaceri.ro/) and Ikonect
(http:/ /www.ikonect.ro) for business, Re țeapescari (http://reteapescari.ro/) for fishing
enthusiasts. Few websites are grouped around bloggers’ communities: OKBlog
(http://www.okblog.ro/), Toateblogurile ( http://www.toateblogurile.ro/ ).

2.3 Social Network Analysis

Social network analysis (SNA) is the process through which one can obtain vital
information about targeted communities, with the aid of specialized tools and algorithms.
Social network analysis views social relationships in terms of network theory consist ing of
nodes (representing individual actors within the network) and connections or links (which
represent relationships between the individuals, such as friendship, kinship, o rganizational
position, etc.). (Nitin Agarwal, 2011)
Further on we will only refer to content based social network (CN -SN) analysis. The
objective of the CB -SN analysis is to detect the themes in trend of collaboration within a
network and monitor how these themes evolve and catalyze the interest of the n etwork
members. Moreover it is also interesting to observe an activity pattern of members, apexes,
topic spans, level of expertise, etc.

2.3.1 Matrix representation

Usual sociological data consists of rectangular array of measurements. The rows are
the subjects, while the columns are scores of different attributes or variables (name, gender,
age, birth date, etc.). A simple ex ample is presented in Figure 6. This kind of data structure
allows an easy comparison between the subjects regarding similarities or dissimilarities just
by comparing the rows, or a more extensive one regarding the distribution of variables across
actors, by correlating and comparing columns .
On the other hand, network data is basically a graph that can be represented is as a
square matrix (incidence matrix), both the rows and columns containing the same subjects,
and in each cell we can mark or describe the relationships among them. We can analyze this
data in the same way as mentioned above. Figure 7 presents a simple example of a network

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

9 data. By comparing rows we can see the similarities between the actors in term of their
choices, and by looking at the columns we observe similarities based on whether they were
chosen by others or not.

First Name Last Name Gender Age Degree Level
Alice Kent F 24 Master
John Randle M 20 Bachelor
Rob Mendel M 21 Bachelor
Figure 6 Matrix representation example
The incidence matrices offer the measure of the relationship between the actors. The
contained elements are not necessarily of binary type, they can take various values depending
on the analysis target:
 Binary, defining the existence (1) or absence (0) of a relationship among two
actors.
 Fuzzy, gradually defining the intensity of the relationship, best friend (1), good
friend (0.5), stranger (0).
 Different numerical values representi ng the position in the Friend list, comments
on the subject’s page (profile, blog), etc.

Figure 7 (Hanneman, 2005)
2.3.2 Graph representation

A first major element in network analysis is the actors’ location in the overall
network. Figure 8 presents the graph representation of the previous data matrix. The actors of

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

10 the network become nodes in the graph, while a connection is represented by an edge. This
representation is fairly small, but we can see it includes up to third level links (Alice -> Bob –
> Ted -> Carol) . Nevertheless data analysts may also be in terested in holistic patterns (in the
presented case compare the number of zeros and ones, compare the cells above and below the
diagonal to check for choice reciprocity, etc.). Graph evolution, cycles, distances, etc. can
represent interesting elements to explore in a social graph.

Figure 8 Graph representation example
Formally, a graph is as ordered pair of sets, denoted G = (V, E), where V = {v |v ∈ V}
is the set of vertices and E = {(v 1, v2) |v1, v2 ∈ V} is the set of edges.
When for each edge (v 1, v2) ∈ E there exist the edge (v2, v1) ∈ E then G is a n
undirected graph. Otherwise the graph is called directed, and each edge is graphically
represented by an arrow starting from the source node (tail of the edge) and pointing to the
destination node (head of the edge) .
Some graph specific definitions and pr operties can be taken into consideration for
social network analysis as well:
 Adjacency – the property of two vertices being connected by an edge, if (v1, v2) ∈ E,
then we say that v1 and v2 are adjacent.
 Incidence – the property of an edge of connecting two vertices, if (v1, v2) ∈ E, then the
edge (v1, v2) is incident with node v 1.
 Degree (or valency) of a vertex – the number of edges incident to the vertex v,
denoted by deg(v). The maximum degree of a gra ph G is denoted by Δ(G), and the
minimum degree of a graph is denoted by δ( G).
 In degree of vertex v – the number of edges entering (heads of the edges) vertex v,
denoted by d-(v).
 Out degree of vertex v – the number of edges exiting vertex v (tails of the edges) ,
denoted by d+(n).
 Neighborhood of a vertex v – the set of vertices adjacent to v .
 Graph dimension – total number of vertices of the graph .
 Graph degree – total number of edges of the graph .

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

11  Graph density – measures how many edges are in E, compared to the maximum
number of edges between vertices in set V.
 Path – way to get from an origin vertex to a destination vertex, by traversing edges in
the graph (for directed graphs the edges must keep the same orientation).
 Simple path – path i n which each edge only appears once, but the vertices can repeat
themselves.
 Elementary path – path in which all the vertices are distinct.
 Cycle – path in which the origin and the destination vertex are one and the same.
 Simple cycle – cycle in which each edge only appears once, but the vertices can
repeat themselves.
 Elementary cycle – cycle in which all the vertices are distinct, except the first and last
one.
 Length of a path/cycle – the number of edges that it uses
Many analysis techniques used on network data are applied in the same way for
conventional data (calculating distances, correlations). Actually it is possible to consider
network data a special form of conventional data, but there are also fundamental diff erences
between them. The focus point of the analysis is not the actor and his attributes, but rather the
whole structure of connections and how the actors are described by their ties. At this point we
can mark the fundamental difference between the two da ta sets: conventional data focuses on
subjects and their attributes while network analysis considers the actors and the relations
among them.
2.3.3 Population samples and boundaries

One can easily understand the amplitude a social graph can reach, thus increasing
permanently the challenge of exploring it. Moreover one must also consider the dynamicity
of the data, because a network can be forever shifting. We arrive here at concepts like
populations and boundaries.
Focusing on connections and actors when analysing a network makes it impossible to
independently select a node, we must always include all the ties and related actors, thus the
tendency is to study entire populations at a time. Each time though the elements of the group
tend to fall within some boundary. We can draw two types of delimitation: ones defined by
the actors themselves (these include cases like same classroom, organization, community, and
neighborhood, thus having a real life equivalent representation) and others taking into
considera tion certain criteria, defining artificial clusters (certain family income, job,
academic degree level).
Boundary expansion can be done by replicating populations, thus studying several at
the same time. This allows hypotheses testing by comparing popula tions. A second way for
network studies to expand their scope is by including multiple levels of analysis (modalities).

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

12 2.3.4 Modality and analysis levels

Each individual often finds himself enclosed in more than one social structure at a
time. One can start fro m the family level, move up to educational or work environment,
organizations, associations, etc. A simple way to explain this view is to consider the
individual person as being part of nested networks (networks embedded in networks that are
embedded in ne tworks, etc.) , a structure called ―multi -modal‖. A ―mode‖ is represented by
any of the networks forming the intricate structure. For example we can study patterns among
students from the same group (this being a single bounded population described in the
previous chapter). But this group exists within a faculty, thus we can replicate the population
considering groups with different specializations or same specialization but in different years
of study. Moreover the faculty is part of a university, which ca n be considered along with
other institutions from the same city, region, country, etc. thus taking the pattern study at
different modalities.
This hierarchical design leads to different levels of analysis (macro -level: global,
international, national; me so-level: state, city, community; micro -level: neighborhood,
family, person). In social network analysis is always interesting to understand how the
individual is embedded within this kind of structure or vice versa, how structures emerge
from relationship s among individuals .

2.4 Social Media

Social media is a concept frequently thrown around nowadays, but it does not really
have a definition that compris es the complexity of the notion. A better understanding of the
term may come from breaking it down. Media represent an instrument on communication ,
like radio, newspaper or television, thus social media should be a social instrument of
communication .
When t alking about social media in the context of Web 2.0, we must introduce the
concept of interactivity. A social media website does not simply offer information, but also
allows the user to interact by either voting articles, adding comments, rating them, etc. Unlike
regular media, social media is a two way road; it does not limit your ability to share your
thoughts on the matter and communicate.
Social Media is a very broad term that has actually evolved from the concept of Social
Networking and the necessity of sharing growing amounts of information and data . We
present examples of social media websites for a bette r understanding of the concept:
1. Social Bookmarking (Del.icio.us, Blinklist, Simpy) – Interact by tagging websites and
searching through websites bookmarked by other people.
2. Social News (Digg, Propeller, Reddit) – Interact by voting for articles and
commenting on them.
3. Social Networking (Facebook, LinkedIn , Last.FM) – Interact by adding friends,
commenting on profiles, joining groups and having discussions.
4. Social Photo and Video Sharing (YouTube, Flickr) – Interac t by sharing photos or
videos and commenting on user submissions.
5. Wikis (Wikipedia, Wikia) – Interact by adding articles and editing existing articles.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

13 6. Blogging platforms (Blogspot, WordPress) – Interact by adding comments and voting
articles.
Next we cont inue to develop the concept of ―blogging‖ and present some blogging
platforms available to a large public without any costs.
2.4.1 Blogs

The word ―blog‖ comes from ―web log‖ (basically a journal on the W eb) and it can be
considered as a personal diary, where every entry is a post on the Intern et. A blog is
essentially a web site composed of multiple articles, displayed in reverse chronological order.
The entries can contain text, pictures, and links towards other blogs or sites. Blogging has
rapidly become more than a self -expression medium, it has essentially become a
communication, collaboration, debate environment, and, why not, an income source , through
advertisements placement .
At first blogs had usually a single author, but nowadays a collaborative blog is not
uncommon (especially when the collaboration emerges from a common
interest/hobby/passion of the participants). Moreover the targeted area of the blog may not be
unique; authors can find it easier to write about a variety of topics and expand their
audience/followers.
For Internet users who do not directly participate in content creation, following a
blog, or multiple blogs depending on the area of interest, can represent e rich and accurate
source of information, an easy way to keep up to date and a co mfortable environment in
which one can interact by posting further questions on comments on the matter in question.
2.4.2 Microblogs

Microblogs are a special form of blogs characterized by short and concise posts (~140
characters). Microblogs can lead to a very fast spread of information because they are easily
received on mobile devices as well. Awareness must be kept here, because this type of posts
can also encourage users to lack attention or interest over the quality of the messages they
send, which can bec ome dull, meaningless, simple routines of the user, with no value
whatsoever for readers. Classical ex amples of microblogs are Twitter and Tumblr.
2.4.3 Blogosphere

The blogosphere represents the set of all blogs and it facilitates the emergence of
virtual communities, glued together by common interests. In the current paper we will take in
consideration a part of the Romanian blogosphere for our study. We find that a topic study
over this area can lead to interesting results and conclusions of the Romanian bloggers.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

14 2.4.4 Blogging Platforms

1. WordPress.com
WordPress.com is perhaps the most feature -rich blogging service available at this moment. I t
uses the popular open source W eb software WordPress, and offers many features in its free
version — traffic stats, anti-spam filters, SEO, gorgeous themes and more.
2. Blog.com
Blog.com is a WordPress powered blogging platform. It offers many beautiful
premium themes in its free accounts, as well as advanced plug -ins that one can expect only
on a self -hosted blog. The fre e storage space offered on Blog.com is 2GB in the basic plan,
which is less compared to the 3GB offered by WordPress. com.
3. Blogger
Google Blogger is a well -known blogging service that offers many features to its
users. For example, Blogger comes with a Temp late Designer user interface that lets you
tweak your blog’s appearance as much as you want.
4. TypePad Micro
TypePad Micro ’s interface is easy to use, and you can import/export content from
many other blogging platforms.
However, in terms of add -ons and t hemes offered, TypePad Micro mostly obliges
users to use a paid plan in order to obtain a decent customization level.
5. Tumblr
Tumblr is a popular microblogging platform. It comes with many outstanding and
interesting features such as audio blogging (for sharing your music, for example), free custom
domains, hundreds of amazing blogging themes and more.

2.5 Web Search

Web usage has faced tremendous increase up to the point where it now claims a
consistent fraction of humanity as partisans, by relying on an open client -server design: (1)
the server communicates with the client via HTTP protocol (or hypertext transfer protocol)
that is lightweight and simple, asynchronously carrying a variety of payloads (text, images,
audio and video files) encoded in a marku p language called HTML (hypertext markup
language); (2) the client – generally a browser, an application within a graphical user
environment.
―The Web is unprecedented in many ways: unprecedented in scale, unprecedented in
the almost -complete lack of coord ination in its creation, and unprecedented in the diversity of

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

15 backgrounds an d motives of its participants.‖ (Christopher D. Manning, 2009) Each of these
characteristics contributes to making the web search different from trad itional documents.
Moreover the task can even be considered harder, due to lack of structure and patterns.
2.5.1 Web Mining

Data mining (DM) is the central step in KDD (Knowledge Discovery in Databases –
process that has the purpose of discovering relations wit hin data sets), corresponding to
algorithm application in order to identify internal data patterns. It is an extended
interdisciplinary field, involving artificial intelligence, machine learning, statistics, database
management, etc. However the task of ex tracting information is not always simple, especially
when dealing with unstructured data sets.
Web mining derives from data mining and represents the use of DM techniques in
order to acquire information from Web documents and services. As mentioned befor e there is
a tight relationship between Web mining, artificial intelligence (machine learning, NLP) and
advanced data analysis. Each subject will be explained further on, and personalized to present
work’s approach.
We can also trace some major differences between Web and data mining. The latter
one is usually carried out on databases, which one can consider to be a controlled and
structured environment, while Web mining deals with Web pages, thus processing documents
which lack structure or consistency whe n talking about HTML tags. Secondly the amount of
processing can dramatically increase in Web mining, compared to data mining. Additionally
a database is usually a restricted environment, a private area in which one needs certain levels
of access rights in order to legally obtain information. On the other hand Web mining
performs the tasks in a public environment, with no access rights required. However when
dealing with automatic data retrieval, there can be some constraints under the form of the
―Robot Ex clusion Protocol‖ and of course the process should be polite, non -invasive and
should not damage in any way the site, for example by overloading it through increased
traffic generation.
Next we present a general guideline for Web mining, what steps one sho uld follow in
order to obtain an eloquent result in the field.
1. Resource gathering: the task of retrieving the necessary documents from the Web.
This task can be done automatically or semi -automatically depending on the targeted
resources. Human involve ment is not unheard of due to dynamicity and interactivity of the
medium.
2. Information selection and preprocessing: automatically selecting and preprocessing
specific information from retrieved Web resources. Before applying any algorithm the data
should be discarded of any existent noise, in order to obtain a clean result.
3. Generalization: automatically discover general patterns across multiple documents,
according to the topic of interest/research. In order to implement this step machine learning is
usually used or other data mining algorithms.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

16 4. Analysis: validation and/or interpretation of the mined patterns, or even extension
of the pattern on other models. Again here humans can play an important role, because an
interactive knowledge discovery med ium is also allowed.
When dealing with the concept of Web mining we also come across others like
information retrieval (IR) and information extraction (IE). Information retrieval has as main
goal indexing documents, with the purpose of easing the search pr ocess in the digital ocean of
resources. On the other hand IE represents the task of processing and transforming a set of
documents into a more digestible and easier to handle form, using maybe even IR in the
process. Thus IR looks to select to most releva nt document given a certain query, while IE
tries to select the most relevant facts from the documents, working at a finer granularity than
the first one.
Furthermore there are two different views to be considered at this point. Looking at
the first two s teps for the Web mining process, we could say the first one can integrate IR,
whilst the second one uses IE to procure the necessary form for the gathered resources, thus
both concepts being pre -processing stages for Web mining. However one can also argue that
Web mining can part of the Web IR (but not all indexing tasks use data mining techniques)
and can be used to improve Web IE, by looking at a larger scale. An IE system is almost
impossible to be manually built, due to the large scale and dynamicity of the Web
environment. Thus it often uses machine learning and mining techniques in order to learn
extraction patterns and obtain the document automatically.
Further on we categorize Web mining in three main areas (from Web Mining
Research: A Survey): Web structure mining, Web usage mining and Web content mining.
2.5.1.1 Web Structure Mining

Web structure mining pursues the goal of discovering the model underlying the link
structures of the Web by using graph theory. This field researched the topology of the
hyperlinks of a Web page and can be used to categorize pages and generate relationship s
between them. According to the type of Web structural data, web structure mining can be
divided into:
1. Extracting patterns from hyperlinks in the Web page.
2. Mining the document structure, analysing the tree -like structure of page in order to
describe HTML or XML tag usage.
2.5.1.2 Web Usage Mining

Web usage mining proceeds to the extraction of data generated by the Web surfing
session (for example: the user’s history). Until now we have only spoken about primary data
mining, but that Web usage mining does go es beyond that, it retrieves secondary data derived
from the users’ interaction with the Web. Web usage mining can discover usage patterns in
the users data, browsing behaviors that show what he/she likes/prefers/is interested in. The
Web usage data can in clude the data from Web server access logs, proxy server logs, browser

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

17 logs, user profiles, registration data, user sessions or transactions, cookies, user queries,
bookmark data, mouse clicks and scrolls, and any other data as the results of interactions.
We can further classify Web usage mining in three categories, depending on the kind
of usage data considered :
 Web Server Data: The user logs are collected by the Web server. Typical data
includes IP address, page reference and access time.
 Application Ser ver Data: Commercial application servers have significant
features to enable e -commerce applications to be built on top of them with
little effort. A key feature is the ability to track various kinds of business
events and log them in application server lo gs.
 Application Level Data: New kinds of events can be defined in an application,
and logging can be turned on for them thus generating histories of these
specially defined events. It must be noted, however, that many end
applications require a combination of one or more of the techniques applied in
the categories above.

2.5.1.3 Web Content Mining

Web content mining describes the discovery, extraction and processing of data,
information and knowledge from the content of a Web page. Right now a web page can
engulf a wide range of data types like text, images, audio, video, metadata, hyperlinks.
Usually we can deal here with unstructured data such as free texts, semi -structured
data such as HTML documents, and a more structured data such as data in the tables or
database generated HTML pages. The research around applying data mining techniques to
unstructured text is termed knowled ge discovery in texts (KDT).
The diversity and the lack of structure that characterize the ever -expanding
information sources on the Web (such as hypertext documents) make automated discovery,
organization, and search very difficult. Due to these factors researchers have been prompted
to develop more intelligent tools for information retriev al (such as intelligent web agents) as
well as to expand all the techniques such that they provide a higher level of organization for
semi -structured data available on the web.
We can discus about two approaches for Web content mining: agent -based and
database.
 Agent -Based Approach represents the creatio n of AI systems that discover,
acquire and organize Web data automatically or semi -automatically.
 Database Approach focuses on integrating and organizing the data into
structured collections (like relational databases).
When extracting Web content informat ion using web minin g, there are four typical
steps:

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

18 1. Collect ing – acquire the content from the Web .
2. Parsing – extract usable data from the HTML pages or other document types .
3. Analyzing – tokenize, rate, rank, classify, cluster, filter, sort, etc.
4. Producing – use the result for further reports, studies, etc.

Table 1 presented in (Raymond Kosala, 2000) gives a general image of the above
Web m ining categories , but note the fact that the one of the two Web content mining view s is
renamed: information retrieval instead of agent -based.

Table 1 Web mining categories (Raymond Kosala, 2000)

2.5.2 Web Crawling

A Web crawler (also called Internet bot, Web spider, automatic indexer or Web
scutter) is a software application that gathers page s from the Web for various purposes . The
task is performed automatically, thus saving a lot of time (compared to it being done
manually). Unfortunately the lack of semantics on the nowadays Web complicates this task,
because the content of the web pages is just machine -readable, but not machine
unde rstandable. Thus this task has to be carefully carried out in order to sort and store the
necessary metadata and information.
Web crawling is the process of automatically scanning Web pages with a crawler in
order to extract various data from them . Many applications, especially search engines, use
crawlers in order to extract and classify Web pages’ content for easing information retrieval
for users . Crawlers can also come i n handy for actions like Website maintenance,
automatically performing tasks like link and HTML tag validation. A negative example for
crawler implementation is the search for e -mail addresses for different advertisement
campaigns.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

19 2.5.2.1 Guidelines

In order to build the crawler for this work a few of the principles stated in (Christopher
D. Manning, 2009) will be used as guidelines:
 Robustness: The web contains servers that generate spider traps , which create web
pages that mislead crawlers into getting stuck in an infinite fetching loop. Crawlers
must be designed to be resilient to such traps.
 Politeness: Web servers have both implicit and explicit policies regulating the rate
at which a crawler can visit them. This standard is kn own under the name of
“Robot exclusion protocol” . Usually a file named robots.txt is placed at the root of
the URL which specifies the website’s policies for crawling.
 Extensible: Crawlers should be designed to be extensible in many ways – to cope
with new data formats, new fetch protocols, and so on. This demands that the
crawler architecture be modular.

2.5.2.2 Crawler Architecture
1. The URL frontier, containing URLs yet to be fetched in the current crawl (in the case of
continuous crawling, a URL may have been fetched previously but is back in the frontier
for re -fetching).
2. A fetch module that uses the HTTP protocol to retrieve the web page at a URL.
3. A parsing module that extracts the t ext and set of links from a fetched web page.
4. A duplicate elimination module that determines whether an extracted link is already in the
URL frontier or has recently been fetched.

Figure 9 Block representation of basic crawler architecture

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

20 2.5.2.3 Crawler classification

As we mentioned before crawlers can be used for a large variety of purposes. We
present next an approximate classification of Web spiders, although most of the times they
present hybrid versions of the classes .
 Commercial crawler – is used to search for the best offer of a product and p resent
it on specialized Web sites or even to compare with internal products and prices of
the firm.
 Download crawler – is malware which interfere s with a users Internet connection
in order to automatically download Web pages.
 Spam crawler – also called spambot , collect s email addresses, phone numbers in
order spread advertisements.
 Testing crawler – is used for maintenance purposes: to validate links and verify
HTML code for sites.
 Copyrights crawler – search es the Internet for copied content, in order to help the
rightful author obtain compensations for the infringement.
 Spy crawler – is a spyware , software that gathers information about persons or
organizations without their knowledge or consent and can send t he gathered
information to rival entities.
 Hacking crawler – searches for vulnerable systems and exploits them.
 DDoS crawlers – are computers and networks which are part of a so called botnet
(or zombie army) and can be used in the distribution of DDoS (De nial of Service)
attacks on other sites or servers. The owners of the systems are not aware of the
takeover. Moreover the computers (called in this case zombies) can be accessed
and controller externally by the owner of the botnet at any moment.

2.6 Topic Cl assification

Topic classification ( text classification , text categorization , or topic spotting )
represents the analysis of a text and labelling it under a topic based on its content. Text
categorization can be considered a pattern classification task for text mining (subclass of data
mining) that can lead to a better management of textual information systems.
There are two types of approaches for the topic classification problem:
 Rule-based approach – classification rules are manually defined in the form of if –
then-else clauses and the documents are classified accordingly. This method has
a high precision, but poor recall especially because its lack of flexibility and the
enormous amo unt of manual labour required .
 Machine learning approach – classification rules or equations are automatically
generated using sample labelled documents. Although the precision is a bit lower
than in the previous case, the increased flexibility makes it the preferred solution
for the topic classification problems.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

21 2.6.1 Problem statement
In text classification, we are given a description of a document, where is
the document space ; and a fixed set of classes:
Classes are also called categories or labels .
Typically, the document space is some type of high -dimensional space, and the
classes are human defined for the needs of an application.
We are given a training set of labelled documents where
Using a learning method or learning algorithm , we then wish to learn a classifier or
classification function γ that maps documents to classes:
(Christopher D. Manning, 2009)

2.6.2 Machine Learning

Machine learning is a subfield of artificial intelligence which deal s with the
construction and study of systems that can learn from a set of given data , through different
learning algorithms . Machine learning is used in a variety of application nowadays like in
Web search, spam filters, recommender systems, ad placement, credit scoring, fraud
detection, stock trading, drug design, and many other applications.
In order to build a classification system based on ma chine learning one should
consider the following components, as they were presented in (Domingos, 2012) :
 Representation – the classifier must have a machine understandable representation
in a chosen formal language, so that it can be computer handled. The classifier is a
system that receives as input feature values and outputs a class. Thus the chosen
representation for the features is also an important matter.
 Evaluation – the performance of the classifier must be evaluated using evaluation
functions (also called scoring functions or objective functions). They can either be
internal, used during the learning process, or external , used for optimization .
 Optimization – one needs a function to searc h among classifiers for the highest –
scoring ones. The chosen optimization technique is the key to a good performing
classifier.
Some examples for each component are offered in Table 2. Note that not all the
combinations between the three columns are possible. For example, discrete representations
naturally go with combinatorial optimization, and continuous ones with continuous
optimization.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

22 Representation Evaluation Optimization
Instances
K-nearest neighbor
Support vector machines  Accuracy/Error rate
 Precision and recall
 Squared error
 K-L divergence Combinatorial optimization
Greedy search
Beam search
Branch -and-bound
Hyperplanes
Naïve Bayes
Logistic regression  Likelihood
 Posterior probability
 Information gain
 Cost/Utility
 Margin Continuous optimization
Unconstrained
 Gradient descent
 Conjugate gradient
 Quasi -Newton methods
Graphical models
Bayesian networks
Conditional random fields Constrained
 Linear programming
 Quadratic programming
Sets of rules
Propositional rules
Logic programs
Neural networks
Decision Trees
Table 2 Learning algorithm components (Domingos, 2012)

Figure 10 presents a general view on the machine learning process, applied to this
particular work. After the document content is fetched from the database (the content was
previously acqui red by the web crawler), the text is tokenized, and each word further suffers
from pruning algorithms : the stopwords are eliminated and the words are stemmed. Feature
selection is applied to further trim the content to the most important words, and finally the
chosen learning algorithm is applied.

Figure 10 General process of text classification

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

23 2.6.2.1 Pruning
As the traditional definition of the word suggests, pruning represents the process of
trimming, removing what should be considered to be as superfluous or undesirable. For our
context this task is carried out by two actions: stemming and Stoplisting.
2.6.2.1.1 Stemming

Stemming is the process of transforming words into thei r stems (e.g. ―choosing‖ and
―chose‖ become ―choose). This simplifies any problem or process of dictionary matching or
in our case of topic modelling. Moreover we can also choose to remain with word roots in
order to further simplify the context.
In the p resent context, because we only consider text in Romanian, the project
incorporates a Romanian Stemmer written in java , modelled by the Snowball project1. This
offers a great advantage, removing the necessity of having an extensive vocabulary of the
Roman ian language, and also a vocabulary of personalities, locations, etc. The content is
stemmed and personalized, in the end obtaining in the word list both common and proper
nouns (representing unique entities like cities, politicians, writers, locations, etc.). We
consider the incorporation of a stemmer in the application to bring a tremendous advantage
for the pruning process, removing the necessity of additional tables in the database, not to
mention the fact that a full list of Romanian proper nouns is not available, and its
construction should have been a manual, tedious, time consuming task.
2.6.2.1.2 Stoplisting

As anyone can imagine, any block of text also contains informative words like
pronouns, articles (e.g. ―you‖, ―the‖). Stoplisting is the process of removing all these
informative words, otherwise useless for the purpose presented here. Figure 11 presents a few
examples of the list of Romanian stopwords used in the application.

Figure 11 Extract from the list of Romanian stopwords

1 http://snowball.tartarus.org/

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

24 Moreover there are another two sets of words that can be removed: the ones that occur
seldom and the ones that are present in more than 10% of the blogs, due to the fact that their
characteristics would not make them proper candidates to offer meaningful topics. But this
type of filtering is done in our work using feature selection, explained in the following
section.
2.6.2.2 Feature selection

It is easy to understand how wo rds can have various importance in relation to a given
subject. For example the word ―minister‖ has a significant lower importance for the topic
―Religion and Spirituality‖ than the word ―God‖, due to is ambiguity (it may refer to a
Christian minister, a d iplomatic rank below ambassador, the head of a government
department). But how can one discover automatically a word’s weight in relation to a given
class?
As a solution to the above mentioned problem we chose TF-IDF (term frequency –
inverse document freq uency) , a numerical statistic that reflects the importance of a word in a
collection of document in relation to a given set of topics. By applying TF -IDF weighting we
obtain a list of words from a document ranked in the order of importance for a chosen cla ss.
2.6.2.3 TF-IDF

As we mentioned before, in order to further improve our domain (list of words
retrieved from a blog) we used TF -IDF weighting for each word . The TF -IDF value increases
proportionally to the number of times a word appears in the document, but is offset by the
frequency of the word in the corpus, which helps to control for the fact that some words are
generally more common than others.
TF-IDF is the product of two statistics: term frequency and inverse document
frequency. In the case of TF (tf(t,d) where t is the term weighted and d is the document) there
are various methods for determining a value:
 Raw term frequency is the number of times term t occurs in document d
𝑡𝑓(𝑡,𝑑)=𝑓(𝑡,𝑑)
 Boolean term frequency
𝑡𝑓(𝑡,𝑑)= 1,𝑖𝑓 𝑡 𝑜𝑐𝑐𝑢𝑟𝑠 𝑖𝑛 𝑑
0,𝑜𝑡𝑕𝑒𝑟𝑤𝑖𝑠𝑒
 Logarithmically scaled frequency (add one in case the frequency is 0)
𝑡𝑓(𝑡,𝑑)=ln(𝑓 𝑡,𝑑 +1)
 Augmented frequency
𝑡𝑓 𝑡,𝑑 =0.5+ 0.5 ∙𝑓(𝑡,𝑑)
max 𝑓 𝑤,𝑑 𝑤∈𝑑}
For the current work we computed and teste d the results with the raw term frequency,
the logarithmically scaled frequency and the augmented frequency. We decided to choose
between the last two options, and because for our document s the difference between the two

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

25 computed values were insignificant, the final choice was represented by the logarithmically
scaled frequency , because of the ease to compute it .
The inverse document frequency is a measure of whether the term is common or rare
across all documents. It is obtained by dividing the total number of documents by the number
of documents containing the term, and then taking the logarithm of that quotient.
𝑖𝑑𝑓 𝑡,𝐷 =ln|𝐷|
𝑑∈𝐷|𝑡∈𝑑 +1
|D| – represents the cardinality of D, the total number of documents
| 𝑑∈𝐷|𝑡∈𝑑 | – represents the number of documents where term t appears.

Finally, TF -IDF is computes as:
𝑡𝑓−𝑖𝑑𝑓 𝑡,𝑑,𝐷 =𝑡𝑓 𝑡,𝑑 ∙𝑖𝑑𝑓(𝑡,𝐷)

Tables 3 and 4 present the first 10 words for the topic ―Religion and spirituality sorted
by term frequency and then after TF -IDF.
Word TF TF-IDF
nr 262 7.724646
dumnezeu 235 12.580938
euro 230 7.544793
aceast 225 5.955066
sau 224 3.754155
spun 196 5.804193
of 184 3.618475
manastir 168 15.618092
biser 160 13.033545
manast 160 14.088644

Table 3 Terms sorted after TF Table 4 Terms sorted after TF -IDF
One can easily notice that Table 4 presents the words in a more stable order, with
more ―weight‖ when the relation with the topic is taken into consideration. We do not argue
that t he words form Table 3 are not important, as one can notice the significant words are
have much bigger TF-IDF scores than the one not related to the subject. But here we prove
that the weighting system rea lly makes an important difference and can actually be used for
automatic feature selection, whilst the TF method would need a lot of manual selection as
well, as it does not eliminate more general words like ―euro‖ or ―spun‖ here.
We have to mention here that we noticed while looking over the selected keywords
lists a lot of English words related to the subject they were selected for. This was possible
only due to the incorporated stemmer and a good numeric statistic for term weighting. Word TF TF-IDF
iisus 92 17.731633
invier 55 17.379408
stareț 29 17.04216
dejan 27 16.696462
protopop 27 16.696462
pilat 45 16.530114
psalm 45 16.530114
bucium 43 16.338194
șinc 25 16.325134
brancoveanu 22 15.710818

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

26 2.6.3 Machine Learning Ap proaches

In the next sections we present and explain some of the representations mentioned in
Table 2 and also used in the application build for this work.
2.6.3.1 Naïve Bayes Classification

Naive Bayes Classification method is one of the most simple and most frequent
utilized for classification problems. This approach describes a statistical method for solving
the problems based on Bayes’ Theorem. The Bayesian Classifier is called naïve becau se it
supposes the independence of the characteristics.
Bayes Theorem
Let 𝑑𝑖, where 𝑖∈{1,…,𝑁}, be one of the N documents that need to be classified. We
define it using a vector v with M elements : 𝑑𝑖= 𝑑𝑖,1 𝑑𝑖,2… 𝑑𝑖,𝑀 = 𝑣1 𝑣2… 𝑣𝑀 . The
document will be associated to a class 𝐶𝑗,𝑗∈{1,…,𝐾} chosen from the K existing classes.
Using Bayes, the probability that document 𝑑𝑖 belongs to a class 𝐶𝑗 is given by the
relation:
𝑃 𝐶𝑗 𝑑𝑖 =𝑃 𝑑𝑖 𝐶𝑗
𝑃 𝑑𝑖 𝑃 𝐶𝑗 (1)
Document 𝑑𝑖 will belong to that class for which the membership probability is the
highest :
𝑐𝑙𝑎𝑠𝑠 𝑑𝑖 =arg𝑚𝑎𝑥𝑐𝑗 𝑃 𝐶𝑗 𝑑𝑖 (2)
Using (1) and the fact that 𝑃 𝑑𝑖 is the same for all the class, the classification of the
document 𝑑𝑖 is given by:
𝑐𝑙𝑎𝑠𝑠 𝑑𝑖 =arg𝑚𝑎𝑥𝑐𝑗 𝑃 𝑑𝑖 𝐶𝑗 𝑃(𝐶𝑗) (3)
We can classify the Bayesian classifier in two categories, after the method of
computing the probability 𝑃 𝑑𝑖 𝐶𝑗 : Bernoulli multivariate models and multinomial models.
The multivariate models represent the document as a vector 𝑣 = 𝑣1 𝑣2… 𝑣𝑀 , where M is
the dimension of the vocabulary (number of keywords) and each binary element 𝑣𝑖 flags the
presence or absence of the keyword 𝑤𝑖 in the document. The main drawbac k of this model
comes from the fact that it does not take into account the number of apparitions for each
keyword. In the multinomial model the value of each element is given by the apparition
frequency of that word in the docum ent, disregarding the word o rder.
The Bayesian Classifier has of course some disadvantages:
– The dimension of the documents affect the classification, because as the
document is longer more terms participate in the parameter estimation

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

27 – The categories which contain fewer documents are hard to represent, a feature
selection method being necessary.
Some important advantages on the other hand are:
 Popular choice, being well documented everywhere
 Fast and e asy to implement
 Gives excellent results
2.6.3.2 K-Nearest Neighbour

The K -NN classifier is based on the idea of associating an element to the class where
most of its neighbours are placed. It is one of the simplest c lassification methods, being a
non-parametric lazy learning algorithm , where the approximations are locally done and the
computatio ns are performed just at the classification moment. Moreover it does not make any
assumptions on the underlying data distribution. Being a lazy algorithm means that does not
use the training data points to do any generalization , thus there is no explicit t raining phase or
it is very minimal.
For the classification process we compute the distance between the document to be
classified, represented by the characteristic vector 𝑣 = 𝑣1 𝑣2… 𝑣𝑀 and its neighbors
already classified. For a given constant K, we take the K nearest (classified) neighbors of the
document and we determine the class of the majority of them. The performance of this
classifier depends only on two parameters: the number K of neighbors and the similarity
coefficient. A simple approach is to select 𝐾= 𝑛.
Some advantages of the K -NN classifier are:
 Easy to implement
 Robust for noisy inputs
 Adapted for multi class classification
Disadvantages:
– For N document and M classes per document, finding the K nearest neighbor
is done in O(MN), thus limiting the applicability for sets with large
dimensions. This problem is tackled with using efficient algorithms and
dedicated data structures (k -dimension trees).

2.6.3.3 Support Vector Machines

The S upport Vector Machines (SVM) c lassifier was initially designed for a two -class
linearly separable learning task (linear binary classification) . The idea was to find a
hyperplane that can separate the two classes of given samples with a maximal margin that
could offer the best generalization abi lity. By this we mean that not only the classifier has a
high accuracy on the training data, but also offers a high predictive accuracy for future data.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

28 A margin could be defined as the amount of space between the two classes, as it is defined by
the hyper plane. Geometrically, the margin represents the shortest distance between the
closest data points to any point on the hyperplane.

Figure 12 Optimal hyperplane in SVC (Xindong Wu, 2009)
Let w denote the weight vector , b the bias in the optimal hyperplane and x a sample .
The corresponding hyperpla ne can be defined as: 𝑤𝑇𝑥+𝑏=0 (1)
Let 𝑔 𝑥 =𝑤𝑇𝑥+𝑏, the discriminant function , also called x’s functional margin
given w and b . The distance from x to the optimal hyperplane can be defined as:
𝑟=𝑔(𝑥)
𝑤 (2)
Consequently the goal is to find the parameters w and b for an optimal hyperplane, in
order to maximize the margin of separation. That is why SVC is also called maximal margin
classifier.
But sometimes the data samples are not linearly separable in the ini tial space . In order
to surpass this problem, a projection in a superior order space is done with the help of a
function Φ and here we find the optimal hyperplane.
The extension from a binary classifier can be done in multiple ways. The simplest
solution i s to binary separate each class with an optimal hyperplane from all the other classes.
For C classes we compute P hyperplanes, with a linear complexity. Another solution is to
binary separate all the possible pairs of classes by determining the optimal hyp erplane. For C
classes we have C(C -1)/2 pairs and the same number of hyperplanes , with a square
complexity . The increase in complexity is compensated by the reduced dimension of the
hyperplanes.
A few of the advantages of the SVM classifier:
 SVM are less disposed to overfitting
 They offer a fast alternative method for training and classifying
 They can have as input a large number of documents

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

29 Disadvantages:
– SVM implementation is quite complex
– The performances are inverse proportional with the number of input
documents
For our application we use the Weka library to implement this type of classifier.
Nevertheless, we use an optimized version of the algorithm. Support Vector Machines using
Sequential Minimal Optimization (SMO) . Training a SVM requires the solution of a large
quadratic programming (QP) optimization problem. SMO approach is to break this QP this
problem in a series of smallest possible QP problems. The problems are solved analytically ,
to avoid a time -consuming numerical QP optimization as an inner loop. The amount of
memory required for SMO is linear in the training set size, which allows SMO to handle very
large training sets. Because large matrix computation is avoided, SMO scales somewhere
between linear and quadratic in the training set size for various test problems.
2.6.3.4 Clustering Algorithms

Clustering algorithms ’ goal is to map the data into clusters, groupings of the data
based in similarities. Unlike classification and prediction which analyse class -label data
items, clustering analyses the data without any labels and tries to generate them afterwards.
There are many clustering algorithms available, we will only mention the most used ones.
2.6.3.4.1 Hierarchical Algorithms
This methods create a hierarchical decomposition of the data items in the f orm of a
tree like diagram called dendrogram (Figure 13). There are two approaches available here.
Agglomerative approach (bottom -up approach) sta rs by considering that each separate item is
a group and successively merging the objects together, until only one group remains. Divisive
approach (top -down approach) start with all the objects into the same cluster and continues
until all the items are in separate cluster s.

Figure 13 Matlab Dendrogram Plot example

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

30 2.6.3.4.2 Partitional Algorithms
This methods involves segmenting the data items into k partitions, over a number of
iterations, o ptimizing some criteria.
K-means Algorithm
This is the most popular algorithm of this category. It basically selects k random
objects as cluster mean or centre and then works towards optimizing square error criteria,
defines as:
𝑥−𝑚𝑖 2
𝑥=𝐶𝑖𝑘
𝑖=1,𝑤𝑕𝑒𝑟𝑒 𝑚𝑖 𝑖𝑠 𝑡𝑕𝑒 𝑚𝑒𝑎𝑛 𝑜𝑓𝑐𝑙𝑢𝑠𝑡𝑒𝑟 𝐶𝑖
The main steps of the algorithm are:
a) Assign initial means 𝑚𝑖
b) Assign each data item x to the cluster 𝐶𝑖 for the closest mean
c) Recompute clusters’ mean
d) Repeat until criteria function converges (i.e. there are no more new assignments).

2.6.3.5 Decision Trees

Decision Tree (DT) algorithm is one of the most popular choices for leaning from
feature based examples. A DT is a classification schema that generates a tree structure and a
set of rules from a gi ven dataset modelling different classes. Each internal node defines a test
on an attribute, each branch represents an outcome of the test and the leaf nodes present the
class distribution.
Some major advantages of DT are:
 DT are capable of generating and modelling understandable rules (expressed
in a natural language)
 They can handle both numerical and categorical attributes
 They are robust when facing a noisy data set
A few weaknesses of DT are:
– The process of growing a DT is computationally expensive.
– DT are predisposed to overfitting, some branches being too specific for the
given training set. Some pruning methods are in order after the tree creation
to eliminate such branches. These methods are implemented in algorithms like
C4.5.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

31 ID3 Algorithm
The ID3 algorithms ( Iterative Dichotomizer 3 ) was introduced by Ross Quinlan
(1986) for constructing the decision tree from data. ID3 is the precursor to the C4.5
algorithm . In ID3, each node corresponds to a splitting attribute and each branch is a possible
value of that attribute. At each node, the splitting attribute is selected to be the most
informative among the attributes not yet considered in the path from the root. The algorithm
uses the criterion of information gain to determine the goodness of a spli t. The attribute with
the greatest information gain is taken as the splitting attribute, and the dataset is split for all
distinct values of the attribute .
The ID3 algorithm begins with the original set S as the root node. On each iteration of
the algorith m, it iterates through every unused attribute of the set S and calculates the entropy
H(S) (or information gain IG(S) ) of that attribute. Then selects the attribute which has the
smallest entropy (or largest information gain) value. The set S is then split by the selected
attribute to produce subsets of the data. The algorithm continues to recourse on each subset,
considering only attributes never selected before.

Figure 14 ID3 classifier example for Weather set (School, 2003)
C4.5
C4.5 is an extension of the ID3 algorithm . The decision trees generated by C4.5 can
be used for classification, and for this reason, C4.5 is often referred to as a statistical
classifier .
C4.5 builds decision trees from a set of training data in the same way as ID3, using
the concept of informati on entropy. The training data is a set S = s1, s2, … of already
classified samples. Each sample s i consists of a p -dimensional vector (x1,i, x2,i, …, x p,i), where
xj represent s attributes of the sample, as well as the class si is part of .
At each node of the tree, C4.5 chooses the attribute of the data that most effectively
splits its set of samples into subsets enriched in one class or the other. The splitting criterion
is the normalized information gain (difference in entropy). The attrib ute with the highest

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

32 normalized information gain is chosen to make the decision. The C4.5 algorithm then repeats
this algorithm on the remaining attribute list.
2.6.3.6 Neural Networks

―Artificial neural networks (ANNs) are non -linear data driven self adaptive ap proach
as opposed to the traditional model based methods. They are powerful tools for modeling,
especially when the underlying data relationship is unknown. ANNs can identify and learn
correlated patterns between input data sets and corresponding target va lues. After training,
ANNs can be used to predict the outcome of new independent input data. ANNs imitate the
learning process of the human brain and can process problems involving non -linear and
complex data even if th e data are imprecise and noisy.― (School, 2003)
The most important feature of neural networks is the adaptive behaviour , or learning
from examples. They do not need any programming in solving the problem.
Characteristics:
– NNs present mapping abilities: map input patterns to their associated output
patterns.
– NNs learn from examples, thus its architecture allows it to be trained from
known examples of a problem and afterwards to apply inference capabilities
on new instances.
– NNs also have generalization abilities, predicting new outcomes from past
trends.
– NNs can handle very well incomplete or noisy patterns, being fault tolerant.
A neural network consists of several layers of neurons, connected to each other, so
that they can receive input, process it and pass th e output to the subsequent layer. A neuron is
a real function of the input vector (𝑝1,…,𝑝𝑛). The output is obtained from the relation:
𝑎𝑗= 𝑓 𝑥𝑗 =𝑓(𝑏+ 𝑤𝑖𝑗𝑝𝑗𝑛
𝑖=1)
f is the transfer function, generally a sigmoid (logistic or tangent hyperbolic) , that is
applied on the weighted sums of the inputs plus the bias b. The weights and the bias are both
adjustable scalar parameters of the neuron. The bias is similar to a weight, except it has a
constant input of 1. A representation is given i n Figure 15.

Figure 15 Artificial neuron with one input and bias

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

33 2.6.3.6.1 Neural Ne tworks Architecture

We defined the NNs as systems composed of interconnected processing elements
(neurons), designed after the structure of the cerebral cortex of the brain. Next we will
discuss about two widely used NN architecture.

Feed Forward Networks
The information flow inside a feed forward network has only one direction, from the
input layer to the final output layer (through one or more hidden layers). There are no loops ,
thus the output of any layer only affect the proceeding layers. An e xample is presented in
Figure 16.

Figure 16 A multilayer feed forward network (School, 2003)

Recurrent Networks
These networks contain at least one feedback loop, that is layers could have feedback
connections, or even there can exist neurons with self -feedback links (the output of the
neuron is fed back into itself as input). An example is presented in Figure 17.

Figure 17 A recurrent network (School, 2003)

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

34 2.6.3.6.2 Learning Methods
We can classify the learning algorithms in three basic types: supervised, unsupervised
and reinforced.
Supervised learning
This type of learning involves that the input pattern used to train the network is
associated with an output pattern, which is the target or the desired pattern. During the
learning process , a comparison is made between the network’s co mputed output and the
correct expected output, to determine the error. The error can then be used to change network
parameters, which result in an improvement in performance.
Unsupervised learning
In this learning method, the target output is not presented to the network. The system
is forced to learn on its own by discovering and adapting to structural features in the input
patterns.
Reinforced learning
This type of learning involves not mentioning the expected answer but only indicating
if the computed ou tput is correct or incorrect. The information provided should guide the
network in the learning process. A reward is given for a correct answer computed and a
penalty for a wrong answer.
2.6.3.6.3 Types of Neural Networks
There are many types of artificial neural networks, we do not expand on each of them,
but only on the model used from the Weka library in this project. A few important classes
include:
 Feed Forward Networks (Multilayer Perceptron )
 Radial Basis Function Networks
 Kohonen Self Organizing Networks
 Recurrent Networks (Hopfield, Boltzmann)
Multilayer Perceptron
A multilayer perceptron (MLP) is a feed forward artificial network and one of the
most popular forms of neural networks. A few characteristics of the MLP are:
 has any number of inputs
 has one or more hidden layers with any number of units
 uses linear combination functions in the input layers
 uses generally sigmoid activation functions in the hidden layers
 has any number of outpu ts with any activation function
 has connections between each consecutive layer in the network

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

35 A MLP with just one hidden layer can learn to approximate almost any functio n to
any degree of accuracy (a statistical analogy is approximating a function with nth order
polynomials ) Because of this MLPs are known as u niversal approximators. Usually one
hidden layer is always sufficient provided you have enough data, but there can be situations
where a network with two or more hidden layers may require fewer hidden units and weights
than a network with one hidden layer, so using extra hidden layers sometimes can improve
generalization.
2.7 Evaluating Classifiers

Because this is a research project, we also follow to measure the performance
achieved by the leaning algorithms. There are multiple possibilities to fulfill thi s task.
2.7.1 Table of Confusion

For a supervised learning with only two possible cases, all measures of performance
could be based on four numbers: true positive tp, false positive fp, true neg ative tn and false
negative fn. They are arranged in a table called the table of confusion (or confusion matrix) ,
with two rows and two columns.
 True positive – actual number of instances correctly classified under class C
 False positive – number of instances incorrectly labeled as class C
 False negative – instances that should have been classified under class C, but were
not
 True negative – number of all the remaining instances correctly classified as not
being part of class C
predicted
positive negative
truth positive tp fn
negative fp tn
Table 5 Table of Confusion
The entries in the confusion matrix are all integers. The sum of the four entries
tp+tn+fp+fn=n , is the number of test examples. From there entries ma ny statistics can be
performed:
 Accuracy – what percent of the prediction were correct
𝑎=(𝑡𝑝+𝑡𝑛)/𝑛
 Precision – what percent of the positive predictions were correct
𝑝=𝑡𝑝/(𝑡𝑝+𝑓𝑝)

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

36  Recall – what percent of positive cases were caught
𝑟=𝑡𝑝/(𝑡𝑝+𝑓𝑛)
The confusion matrix can also be extended to apply for multi -class learni ng
algorithms, like in our case:
predicted
Topic 1 … Topic n
truth Topic 1

Topic n
Table 6 Extended confusion matrix
2.8 State of the Art

The mentioned fields of study have only arose in the recent years, thus the present
stage of the branch research is still in an early phase, experiments are conducted, approaches
and algorithms are being tested and results are being analysed. In Romania, the subject is
even less developed due to its high complexity and its interdisciplinary character (information
engineering, sociology, anthropology, neural networks, data mining etc.).
2.8.1 uClassify

uClassify2 is a free text classifier web service . The site offers a variety of services
like: language detection, Web page categorization, written text gender and age recognition,
mood, sentiment, spam filter, automatic e -mail support. All the services are offered for
English texts and depend on the account package the user signed for (Free, Indie $99/year,
Professional $1999/year, Enterprise $5999/year) .
uClassify allows users that created an account on the Web site to configure their own
classifiers, by choosing categories and train the net works. Moreover they offer support for
Web categorization. The interface is shown in Figure 18. The available topics are: Arts,
Business, Computers, G ames, Health, Hom e, Recreation, Science, Society and Sports and the
user can either enter the URL or directly the text.

2 http://www.uclassify.com/

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

37
Figure 18 uClassify Web Categorization
uClassify also provides an API3 for text classification that uses URL or XML requests
and provides XML or JSON responses. In order to use the API, the user has to have an active
account, because he /she needs API keys. After creation each account is provided with two
keys: one for reading responses and one for writing requests. Moreover uClassify also offers
wrappers for the API for Java, PHP, Python, Ruby.

3 http://www.uclassify.com/UrlApiDocumentation.aspx

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

38 3. Designing the A pplication
3.1 General overvie w

We present here the general overview of the application, from a business point of
view, how the user will interact with the system and what to expect from it.
3.1.1 Business Model

A business model captures knowledge on how the business functions and should also
serve as a basis for the information system’s model, ensuring consistency and accurate
requirements being passed on to the software design. The first step in designing the
application is to model the domain fro m which the data will be extracted. A domain model is
a representation of real -world conceptual classes, not of software components. This
conceptual model will also be the basis for the database design. Figure 19 present a business
model for our application.

Figure 19 Business model
 Domain – the host of the URL entered is considered to be the domain in our model.
For example for http://nastase.wordpress.com/ the domain is the entire platform where
all the posts are presented: nastase.wordpress.com.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

39  Blogroll – each domain is considered to have links towards other blogs, in our model
a list of Blogroll items.
 Blogpost – this item represents a post from the blog. The domain may contain more
than one blog post.
 Location – this item refers to the suffix of the domain. For example ―.ro‖ has
Romania as location.
 Category – presents the topics used for classification . Each domain will be assigned a
class, thus all its subcategories ( posts) will belong to the same class.
 Keyword – each category will be defined by a number of keywords.
 Stopword – the list of words needed to be eliminated from the blog content.
3.2 Requirement Analysis

There is only one type of stakeholder interacting with our application: the user . The
application is accessible to anyone everywhere. No account is needed in order to the various
classification algorithms put at the large public disposal. Next we describe functional and
non-functional requirements of the system.
3.2.1 Use Case Description

Figure 20 Use case diagram
Figure 20 presents the business use case diagram. The user part in this applica tion is
minimal, but what we offer him the possibility to do is unprecedented, especially in the
context of the Romanian language in the text classification field . The diagram shows how
simple it is for the user to interact with the system. The interface i s simple, straightforward
and does not pose any problems for any user type.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

40 3.2.1.1.1 Configure Classifier

Use case name Configure classifier
Short description This use case offers the user the possibility to select from a list of
machine learning approaches the one desired in order to classify the
blog content.
Main flow of
events 1. The user requires the sys tem to display the application page
2. The system allows th e user to vi ew the list of machine
learning algorithms available
3. The user chooses the desired algorithm
Stakeholders User – user of the topic classification application
Triggers User requires the main application page
Preconditions Internet connection is established and the system is connected with
the database
Post conditions User configuration is saved and ready to be implemented when
requested.

3.2.1.1.2 Set URL

Use case name Set URL
Short description This use case offers the user the possibility to enter the URL of the
blog he wants to classify.
Main flow of
events 1. The user enters the URL
Alternative paths 1. The URL entered at step 1 may by malformed, so the
configuration has to be repeated
Stakeholders User – user of the topic classification application
Triggers User requires the main application page
Preconditions Internet connection is established and the system is connected with
the database

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

41 Post conditions User configuration is saved and ready to be implemented when
requested.
3.2.1.1.3 Get URL Classification

Use case name Get URL Classification
Short description This use case offers the user the possibility to obtain the classification
of the chosen blog.
Main flow of
events 1. The user starts the classification process
Alternative paths 1. Various errors may appear during the classification process
(unable to connect to the URL or to retrieve the content, etc.)
so the user may have to return to the previous use case to
reselect a URL.
Stakeholders User – user of the topic classification application
Triggers User requires the application to start the classification
Preconditions Internet connection is established and the system is connected with
the database
Post conditions System returns the topic to which the blog belongs to according to the
classification algorithm.
3.2.2 System Sequence Diagram

A System Sequence Diagram (SSD) is a visualization of the interaction implied
in the scenario of the above presented use cases.

Figure 21 SSD

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

42 3.2.3 Functional Requirements

Functional requirements define the hypothetical behavior that the system should have,
envisions the outputs that should be produces in response to certain inputs.
Requirement Grade
1. System must allow users to add URLs 9
2. System must allow users to select the desired classifier 9
3. System must classify the entered URL with the chosen classifier 10
4. System must allow users the visualize the results of the
classification 9
Table 7 Functional Requirements

3.2.4 Non-functional requirements

 Documentation – due to the nature of the application we consider that documenting
the process behind the application is very important for educational and research
purposes. The system will offer a full description of its capabilities and how it was
build.
 Portability – the application is available to anyone with a browser and Internet
access.
 Performance – being a research project, the performance and accuracy of the syste m
and of its algorithms is very important . Extensive testing takes place in order to
proper adjust and document the whole project.
 Usability – the usability of the system is extremely high, the UI in v ery simple and
straight forward, just enter URL, choose machine learning algorithm and start the
classification.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

43 4. Application Design and Implementatio n

In the next sections we present a detailed design of the application and how it was
implemented. We have chosen a bottom up modular approach for our descript ion. Figure 22
presents an overview of the whole application.

Figure 22 Application overview
Further we describe each tier presented in Figure 22, presenting a detailed design and
implementation, accentuating the most important tier of the application: business tier.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

44 4.1 Information System Tier

Figure 23 Entity Relationship Diagram
Figure 23 presents the ERD (Entity Relationship Diagram) for the TopicDB, the
application’s database in MySQL . We will thoroughly describe each table from the datable.
 Domain table is the central figure of our database . We store here all the
information about the site: the domain name, the activation data (if any), the
robots (if any), and the description (taken from the META tags in the HTML
page). Additionally the domain has a location (depending on the suffix of the
address) and a category either known from the beginning as in the case of domain
training sets or waiting to be classified.

 Location table is a small table containing all the possible suffixes of web pages,
plus a short description about them. The list was acquired by crawling the
resources from Computer Hope4. For this purpose a separate application was built ,
but this is not the purpose of the project so we do not present this separate
application .

4 http://www.computerhope.com/jargon/num/domains.htm

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

45  Blogroll table offers the application the ability to extend towards Web structure
mining and the construction of social graphs between the entities. Each domain
contains multiple blogroll items, representing links towards external Web pages
recommended by the owner o f the blog. The table presents the address of the Web
page linked to the current blog, if the page is also in the database the id of the
domain is mentioned in the Id_destination , the TypeBlogRoll variable should hold
information regarding the activity of the destination (if it is active or not) .

 Blogpost table is probably the most important one in the database. Each domain
can contain up to 5 Blopost items (that was the limit depth set to the crawler) .
Each post is described by the page address, the date when it was posted (if
available for the crawler), the content of the post , if it was processed or not. T he
CRC (Cyclic Redundancy Check) is a verification code for the page, computed
based on the raw content , needed to see if the page has changed in any w ay and
should be re -crawled. The title and the description, if any, are acquired from the
META tags of the HTML page.

 Category table describes each topic, by name and id. As one can see from the
ERD each domain is attributed a category . Figure 24 presents the categories
chosen for the application.

Figure 24 Categories
 Keyword table presents a set of words describing each category. The keywords
have frequency and weight (computed using the TF -IDF presented in section
2.6.2.3 ). We will present a set of 20 words for each topic from a total of about
5000 total words (we note that in the process we dropped the Photography
category, because we focused on text resources and not image resources and the

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

46 Personal Journal is an extensive category, a hybrid of all the others) . In chapter
4.2.4 we present a more detailed rapport about t he keyword – category relation.

Table 8 Activism Table 9 Business and finance

Table 10 Art Table 11 Fashion

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

47
Table 12 Gastronomy Table 13 Literature

Table 14 Politics Table 15 Religion and Spirituality

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

48
Table 16 Travel
 Stopword table holds the list of Romanian stopwords. Each entry is described
by an id and the word itself. Some examples were already presented in chapter
2.6.2.1.2 , Figure 11.
4.2 Business Tier

The Business Tier is the heart of the application. It contains all the modules that store
elements in the database, crawl the Web pages, apply the classification algorithms. We will
extend each module in the following chapters, but first we will state the general principles
used as guidelines when designing the project.

Figure 25 Package diagram of the business tier

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

49 4.2.1 Design principles
4.2.1.1 Dependency Inversion Principle

Dependency Inversion Principle states that we should decouple high level modules
from low level mo dules, introducing an abstraction layer between the high level classes and
low level classes. Furthermore it inverts the dependency: instead of writing our abstractions
based on details, the we should write the details based on abstractions. In order to im plement
this principle we used dependency frameworks in the form of various design patterns:
Factory, Controller, Chain of Responsibility.
4.2.1.2 Low coupling

The coupling between components should be minimized by lowering number of items
passed between componen ts, pass only strictly needed data (data coupling), do not pass
whole structures when only a small part of them is actually used (stamp coupling), avoid use
of control flags (control coupling), and global data (common coupling).
4.2.1.3 High Cohesion

One should a lways try to maximize the cohesion inside a component, meaning all the
elements inside the component should be contributing to the same task or goal. The
component should be build for one single purpose. If this principle is not respected than the
componen t in question must be split in parts that perform one task each.
4.2.1.4 Design Patterns

Whenever dealing with a design problem, try to find existing design patterns as
solutions for the application. The use of design pattern will be accentuated in the following
chapters when presenting the class diagrams for each module.
4.2.1.5 Factoring

The use of factoring is a very important principle. One should always avoid
duplicating functionalities inside components. Cut the common part and put it into a reusable
comp onent. In the following chapters we will present many classes with this purpose.
4.2.2 Persistence Layer

This layer deal s with persisting (storing and retrieving) data from the database. For
this purpose we use JPA (Java Persistence API) , the Java framework for managing relational
data. All the entities are grouped in the ―model ‖ package, while the controllers are in the
―controller‖ package.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

50
Figure 26 Class diagram for the ―model‖ package
We already presented each element from Figure 26 in chapter 4.1, the entities are
basically mapped classes of the tables from the database, eac h column being represented by a
corresponding attribute. Next we present the package ―controller‖ which contains JPA
controllers for each entity plus a main controller that acts as a façade for all the others
(Façade pattern) .

Figure 27 Class diagram for the ―controller‖ package

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

51 Observation: The class ComputeCRC is part of the ―functions‖ package and is a result
of the Factoring and High Cohesion principles. Actually all the classes from that package are
built with the same considerations. A detailed view of the main controller:

Figure 28 MainController class
Needles to say that in this package we use the Controller pattern and, for the
MainController class, the Singleton pattern. We need only one instance of the object
MainController throughout the whole system, because we are basically talking about the
connection to the database. The Façade pattern mentioned earlier is necessary here in order
to keep a low coupling between the co mponents and to avoid obliging them to know and use
suck a complex system. The MainController deals with any operation of storing or retrieving
from the database, the methods are structured according to the entity they refer to .
4.2.3 Crawler Module

The Crawler Module extends in two packages: ―crawler‖ and ―extractor‖. The latter
one also uses function classes from the ―functions‖ package. Figure 29 presents the class
diagram for this module. The crawler uses the LinkRetrieval and RobotParser in order to
enrich its crawling list and the verify the access the URLs before crawling them. The
RobotParser retrieves the ―robots.txt‖ file from the root of the domain and memorizes the

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

52 disallow locations of the site. The crawler respects the banned areas when retri eving content.
The data extraction part is done in the ―extractor‖ package.

Figure 29 Crawler Module class diagram
For the data extraction operation the C rawler calls on the DataExtractor class, a Proxy
for the extraction module . The DataExtractor instantiates the object, based on the request of
the Crawler. The AbstractExtractor is an AbstractFactory in its turn.
The extraction operation is done with the help of the Jsoup library. More information
about the library can be found in chapter 5.1. Each class extract s the data from its
corresponding HTML tags and classes. The extraction process is not an easy task, due to the
wide variety of HTML pages, and due to the fact that the class attributes of the tags are never
named in the same way.
BlogpostExtractor looks i nside all the <div> tags that have a class attribute containing
―content‖ or ―entry‖ in their name, but not ―comment‖ or ―widget‖ and extracts the content .
The selection criterion proved to work on most of the Web si tes, especially if they belong to
the Wo rdPress of Blogspot platforms.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

53 BlogrollExtractor uses the following query for the HTML page: ―ul[class=xoxo
blogroll] li a[href]‖. Thus it selects all the list items that con tain anchor elements and th at are
inside unordered lists that have the class attri bute equal to ―xoxo blogroll‖. Afterwards the
content of each selected item is parsed, verified not to be
For a better understanding on how the crawling process works Figure 30 presents an
activity diagram of this task.

Figure 30 Crawler activity diagram
The crawling process starts from a given link and a maximum depth. For the purpose
of the project we considered a depth of 6 to be sufficient (domain + 5 blogposts). The crawler
begins by extracting the domain and blogroll form the main page and afterwards getting a full
list of banned paths and also internal links to crawl for the site. One by one the addresses are

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

54 crawled and the content of the posts are send to the database through t he corresponding
extractor class . This process is presented in detail in the sequence diagram from Figure 31.

Figure 31 Crawler Sequence Diagram

The LinkRetrieval class parses each link and makes sure before it sends it to the
Crawler that the links are not empty, they are not page anchors, not ―mailto‖ links, not shares
towards SNSs, not Javascript links, and because we notice that links towards posts contain
the d ate of the post, another condition we imposed (especially towards ―Blogspot‖ and
―WordPress‖ blogs) is that the link should contain a date described by the regex (regular
expressi on): ―((19|20) \\d\\d)/(0?[1 -9]|1[012]) ‖ (year/month).

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

55 The RobotParser class r etrieves the ―robots.txt‖ file from the root of the site and
searched for the disallowed paths with in the site, specified under the ―*‖ User -agent. When
the RobotParser receives from the Crawler a link to check, he loops through the disallow
paths and chec ks the permissions. Here is a sample of a ―robot.txt‖ file from the page
http://eusimuntele.blogspot.ro/ :
User-agent: Mediapartners -Google
Disallow:

User-agent: *
Disallow: /search
Allow: /

Sitemap:
http://eusimuntele.blogspot.com/feeds/posts/default?or derby=UPDATED

The first clause bans the Mediapartners -Google bot (used for AdSense ad -serving
purposes) from crawling the site, and does not present any interest for our case. The next
clause states that all crawlers are not allowed to access the /search path of the Web site. The
RobotParser class adds this to his list and makes sure the Crawler respects this prohibition.
4.2.4 Classification Module

The Classification Module is composed of two classifiers. The first one is a custom
build text classifier using the TF -IDF technique (Chapter 2.6.2.3 ) and the seco nd one is a
multi -option classifier built using the Weka library (Chapter 5.2). We only use the Classifier
class from Weka, because the feature selection was done by our own classes.
4.2.4.1 TF-IDF Classifier

The TF -IDF weighting algorithm originates from information retrieval and was used
to return the ranking of documents for a given query, without providing a threshold to define
a decision rule for class membership. So in order to use this method for text classification it
needs to be adapted.
4.2.4.1.1 Obtaining the Keyword Lists

For each topic ("Activism‖, ―Business and finance", "Art", "Travel‖, ―Gastronomy",
"Literature", "Fashion", "Politics", "Religion and spirituality") we extracted from the data base
about 10 domains, with 5 blogposts each.
The DocumentConstructor class handles the content and transforms it into a
Document object. The Document parses the content using the FeatureSelection class. The
Document is then ta ken by the TermFrequency class, along with a DocumentFreque ncy
object and computes for each word a score and stores it.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

56 In order to obtain a n acceptable keyword list we manually surfed through the terms
and eliminated any remaining email addresses, links . We cut each keyword list at about 50 0-
600 words, and then introduce them into the database in the ―keyword ‖ table. In the end we
obtained a small version of the Freebase project5, a community -curated database of wel l-
known places, things and people.
Our FeatureSelection class unites all the changes that have to be made on the input
text and applies them , in the performFeatureSelection() method . The text is tokenized,
stopwords are eliminated and t he remaining words are stemmed. The RomanianStemmer
class is an open source stemmer in java created by the Snowball project (see chapter
2.6.2.1.1 ). A detailed description of the stemming algorithm for the Romanian languag e can
be found on the Snowball project Web site6.

Figure 32 Feature Selection and Pru ning
Figure 32 presents the class diagram for the stemming and stoplisting implementation.
Each Document object the contain s the input of the blog we want to cl assify uses the
FeatureSelection class in order to parse the it. This makes the classification process more
accurate and easier, because we eliminate a lot o f words and shorten the content.

5 http://www.freebase.com/
6 http://snowball.tartarus.org/algorithms/romanian/stemmer.html

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

57 4.2.4.1.2 Algorithm description

Figure 33 TF-IDF classifier class diagram

Figure 33 presents the class di agram of the TF -IDF Classifier. The TermFrequency
and DocumentFrequency classes used in order to compute the weights of the keywords from
the training set are no longer needed for the classification process. It is a simple and
straightforward approach, o nce the keyword lists were build. A small sample of the keyword
list was presented in chapter 4.1. Any input goes through the DocumentConstructor –
Document -FeatureSel ection chain , and then each token is searched in the database. The result
may be a list of keywords under multiple topic. We retrieved the weight for each word and
add the value to a score array, under the correct category . The class membership is defined by
the maximum value of the array. We present next the sequence diagram of the process.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

58
Figure 34 TF-IDF classifier sequence diagram

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

59 4.2.4.2 Text Classifier using Weka

Figure 35 Weka -based classifier class diagram
Because we offer a variety of classification techniques through the Weka library, we
needed two mode classes in order to construct a Factory pattern with the Classifier Factory
class and a Controller for the whole ensemble, the Classifier Controller class. The
ClassifierFactory receives from the controller the String containing the type of classifi er
chose n by the user, the factory creates the corresponding classifier and then sets it to the
TextClassifier class. This class uses the available training sets, according to the set classifier
and offers a score vector representing the distribution for t he topic set ( "Activism", "Business
and finance", "Art", "Travel", "Gastronomy", "Literature", "Fashion", "Personal journal",
"Politics ", "Religion and spirituality" ).
4.2.4.2.1 Available Classifiers
 Naive Bayes – described in chapter 2.6.3.1 .

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

60  Naive Bayes Updatable – this is the updateable version of Naïve Bayes. This
classifier will use a default precision of 0.1 for numeric attri butes when
buildClassifier is called with zero training instances.
 Complement Naive Bayes – the fastest and most efficient version form the
Bayesian classifiers family. It improves accuracy by using data from all
categories except the category it is focuse d on . Moreover it tackles the
uniformity of the data distribution.
 K Nearest Neighbor – presented in chapter 2.6.3.2 .
 Decision tree J48 – class for generating an unpruned or a pruned C4.5 decision
tree, described in chapter 2.6.3.5 .
 Support Vector Machines using Sequential Minimal Optimization – presented in
chapter 2.6.3.3 .
 Multilayer Perceptron – Neural Network described in chapter 2.6.3.6.3 .
4.2.4.2.2 Building the Training Set
In order to train the classifiers we build several ―.arff‖ files, the default format for
Weka training sets. The largest training set contains all the keywords and their corresponding
weights from chapter 4.2.4.1.1 . A distribution of the terms over the topics is presented in
Figure 3 6.

Figure 36 Training set 1 – Word distribution

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

61 The topics are represented in the chart from left to right starting with ―Activism‖.
Notice that for the ―Personal journal‖ topic no training instances were added, because we
consider that it cam comprise any other topic not mentioned here thus an instance that does
not fit in our elected topics is bound to belong to the remaining categor y. This training set
used for most of the presented classifiers. For the J48 decision tree classifier we restricted a
bit the training set. After some tests we discovered that an optimum training set can contain
about 100 words per category. As for the MLP classifier we had to further restrict the training
set to 50 words per category in order to limit the training duration to an acceptable limit.
Next we take a look at the ―.arff‖ file. We constructed the training set with two
attributes: the word and the class . Each word also comes with the associated weight.
@relation TextClassificationProblem
@attribute text string
@attribute topic {activism,'business and finance', art, travel, gastronomy,
literature, fashion, 'personal journal', politics, 'religion and
spirituality'}
@data

{0 photograph,1 art},{7.473194}
{0 expoz,1 art},{7.317021}
{0 desen,1 art},{7.267049}

{0 zapad,1 travel},{10.276278}
{0 echip,1 travel},{10.185519}
{0 cetat,1 travel},{10.061025}
{0 baraj,1 travel},{10.034141}

{0 verdetur,1 gastronomy},{11.161501}
{0 omlet,1 gastronomy},{11.153796}
{0 piept,1 gastronomy},{11.136763}

{0 politician,1 politics},{7.508693}
{0 buget,1 politics},{7.473194}
{0 basesc,1 politics},{7.316656}

{0 iisus,1 'religion and spirituality'},{17.731633}
{0 invier,1 'religion and spirituality'},{17.379408}

4.2.4.2.3 Algorithm Description

The classification is done in the TextClassifier class. After the ClassifierFactory sets
the classifier (available choices are described in chapter 4.2.4.2.1 ) the TextClassifier first of
all uploads the training instances this meaning that it uses the ―.arff‖ files described in the
previous section and creates the instances for the training se t of the class. The next step is to
build the classifier and train it, if this has not been done yet. In order to do so we use a filter
for the data a StringToWordVector in order to convert the string attributes into a set of
attributes representing word o ccurrence information from the text contained in the strings.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

62
Figure 37 Weka -based classifier sequence diagram

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

63 4.3 Web Tier

The implementation of this tier was made with JavaServer Faces technology. JSF is a
server -side component framework for building Java technology -based web applications . One
of the greatest advantages of JavaServer Faces technology is that it offers a clean separation
between behavior and present ation for web applications. Together with JSF we also used the
PrimeFaces library. PrimeFaces is a lightweight open source component suite for Java Server
Faces 2.0.
The first task was to create the JSF Managed Beans components. Because the user –
system int eraction is so simple, we one need one Managed Bean : ClassifierBean. We present
the method s’ description form the Java doc in Figure 38 .

Figure 38 ClassifierBean method description
ClassifierBean class implements the java.io.Serializable interface. It contains the list
of possible classifiers, the link the user want to classify, t he result of the classification and a
possible image generated from the vector scores, displaying a pie chart of the topic ranking in
the parsed blog. After the user enters the URL and chooses the learning algorithm, the
Class ifierBean verifies the URL , removes the ―www‖ predix , then calls the appropriate
classifier controller, depending on the user’s decision, and after the labeling is done, updates
the results. There are two classifier controllers: one for the TF -IDF classifier, another for the
Weka class ifiers. The latter one also needs a switch -case in order to establish the code for the
requested classifier.
The index.xhtml page is a basic JSF page. It contains a input text and a dropdown
element , along with a button. The used librarie:
<html xmlns="http://www.w3.org/1999/xhtml"
xmlns:h="http://java.sun.com/jsf/html"

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

64 xmlns:f="http://java.sun.com/jsf/core"
xmlns:c="http://java.sun.com/jsp/jstl/core"
xmlns:p="http://primefaces.org/ui">

Figure 39 index.html
The page is projected to also display various statistics, in the form of pie -charts,
generates with the JFreeChart library.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

65 5. Libraries
5.1 Jsoup

Jsou p – Java HTML Parser
Jsoup is a Java library for working with real -world HTML. It provides a very
convenient API for extracting and manipulating data, using the best of DOM, CSS, and
jquery -like methods.
Jsoup implements the WHATWG HTML57 specification, and parses HTML to the
same DOM as modern browsers do.
 scrape and parse HTML from a URL, file, or string
 find and extract data, using DOM traversal or CSS selectors
 manipulate the HTML elements, attributes, and text
 clean user -submitted content against a safe white -list, to prevent X SS attacks
 output tidy HTML
Jsoup is designed to deal with all varieties of HTML found in the wild; from pristine
and va lidating, to invalid tag -soup; J soup will create a sensible parse tree.
Example – Fetch the Wikipedia homepage, parse it to a DOM, and s elect the
headlines from the In the news section into a list of Elements (online sample8):
Document doc = Jsoup .connect ("http://en.wikipedia.org/" ).get();
Elements newsHeadlines = doc.select ("#mp -itn b a" );
Open source – Jsoup is an open source project distributed under the liberal MIT
license9. The source code is available at GitHub10.
Getting started
Download the Jsoup jar (version 1.7.2)11.
Read the cookbook introduction12

Source: http://jsoup.org/
5.2 Weka
WEKA – Waikato Environment for Knowledge Analysis

7 http://www.whatwg.org/specs/web -apps/current -work/multipage/
8 http://try.jsoup.org/~LGB7rk_atM2roavV0d -czMt3J_g
9 http://jsoup.org/license
10 https://github.com/jhy/jsoup/
11 http://jsoup.org/d ownload
12 http://jsoup.org/cookbook/introduction/

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

66 Project Objectives
 make ML techniques generally available;
 apply them to practical problems that matter to New Zealand industry;
 develop new machine learning algorithms and give them to the world;
 contribute to a theoretical framework for the field.
Weka is a collection of machine learning algorithms for data mining tasks. The algorithms
can either be applied directly to a dataset or called from your own Java code. Weka contains
tools for data pre -processing, classification, regression, clustering, asso ciation rules, and
visualization. It is also well -suited for developing new machine learning schemes.
Weka is open source software issued under the GNU General Public License.
Main features of Weka:
 49 data preprocessing tools
 76 classification/regression algorithms
 8 clustering algorithms
 15 attribute/subset evaluators + 10 search algorithms for feature selection.
 3 algorithms for finding association rules
 3 graphical user interfaces
– ―The Explorer‖ (exploratory data analysis)
– ―The Experimenter‖ (experiment al environment)
– ―The KnowledgeFlow‖ (new process model inspired interface)
Source: www.cs.waikato.ac.nz/ml/index.html

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

67 6. Conclusions

6.1 Performance Analysis

For the performance analysis of our classifiers we use the techniques described in
section 2.7. We build for each classifier the confusion matrix and for each topic under each
learning algorithm we computed the accuracy, precision and recall. We present the results
trying to add also some graphic interpretations of the data.
First, the distribution of the domain over the chosen topic is represented in Figure 40.
We eliminate from our discussion, from this point on, Photography (image oriented content
blogs) and Personal Journal (a hybrid of the chosen topics) . The total number of domains
from the training set is 219.

Figure 40
Next, we present all the computed confusion matrices for the fol lowing classifiers:
TF-IDF, NB ( NBU is identical with NB) , CNB, KNN and SMO.
TF-IDF Activism Business
and
finance Art Travel Gastronomy Literature Fashion Politics Religion
and
spirituality
Activism 14 0 0 0 1 0 0 5 2
Business and
finance 0 15 0 1 0 0 3 1 0
Art 0 0 6 1 0 1 3 1 2
Travel 0 0 0 9 0 0 0 1 0
Gastronomy 0 0 0 1 34 0 0 0 0
Literature 0 0 0 3 0 21 0 1 8
Fashion 0 4 0 0 0 0 20 1 0
Politics 1 0 0 1 0 0 0 16 1
Religion and
spirituality 0 0 1 0 0 0 1 3 36
Table 17

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

68 KNN Activism Business
and
finance Art Travel Gastronomy Literature Fashion Politics Religion
and
spirituality
Activism 14 0 0 0 1 0 1 5 1
Business and
finance 0 16 0 1 0 0 3 0 0
Art 0 1 7 2 1 0 1 1 1
Travel 0 0 0 9 0 0 0 1 0
Gastronomy 0 0 0 0 35 0 0 0 0
Literature 2 0 0 3 0 18 0 1 9
Fashion 1 6 0 0 0 0 18 0 0
Politics 2 0 0 1 1 0 0 15 0
Religion and
spirituality 4 0 0 0 0 2 0 3 32
Table 18

NB/NBU Activism Business
and
finance Art Travel Gastronomy Literature Fashion Politics Religion
and
spirituality
Activism 13 0 0 0 2 0 0 6 1
Business and
finance 0 7 0 0 1 0 11 0 1
Art 2 3 5 0 2 0 0 1 1
Travel 0 1 0 9 0 0 0 0 0
Gastronomy 0 0 0 0 35 0 0 0 0
Literature 2 0 0 1 4 16 0 0 10
Fashion 1 3 0 0 0 0 21 0 0
Politics 1 0 0 1 1 0 0 16 0
Religion and
spirituality 4 0 0 0 2 1 0 3 31
Table 19

CNB Activism Business
and
finance Art Travel Gastronomy Literature Fashion Politics Religion
and
spirituality
Activism 16 0 0 0 0 0 1 4 1
Business and
finance 0 14 0 0 0 1 4 1 0
Art 0 0 9 3 0 0 1 1 0
Travel 0 0 1 8 0 0 0 1 0
Gastronomy 0 0 0 0 34 1 0 0 0
Literature 2 0 1 3 0 20 0 0 7
Fashion 1 4 0 0 0 0 20 0 0
Politics 2 0 0 1 0 1 0 15 0
Religion and
spirituality 2 0 0 2 0 1 0 5 31
Table 20

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

69 SMO Activism Business
and
finance Art Travel Gastronomy Literature Fashion Politics Religion
and
spirituality
Activism 12 0 1 0 0 1 1 7 0
Business and
finance 0 15 0 1 0 1 3 0 0
Art 1 0 10 1 0 0 1 1 0
Travel 0 0 1 7 0 0 0 2 0
Gastronomy 0 0 0 0 34 0 1 0 0
Literature 0 1 2 2 0 22 0 1 5
Fashion 1 4 1 0 0 0 19 0 0
Politics 2 0 0 1 0 1 0 15 0
Religion and
spirituality 2 0 3 0 2 0 6 28
Table 21

Next we gather all the measures and group them according to the topic they describe,
so we end up having in the same table all the measurements for all the classifiers per topic.
Activism KNN NB CNB TF-IDF SMO
Accuracy 0.922 0.913 0.941 0.959 0.927
Precision 0.609 0.565 0.696 0.933 0.667
Recall 0.636 0.591 0.727 0.636 0.545
Table 22
B&F KNN NB CNB TF-IDF SMO
Accuracy 0.950 0.909 0.954 0.959 0.954
Precision 0.696 0.500 0.778 0.789 0.750
Recall 0.800 0.350 0.700 0.750 0.750
Table 23
Art KNN NB CNB TF-IDF SMO
Accuracy 0.968 0.959 0.968 0.959 0.950
Precision 1.000 1.000 0.818 0.857 0.556
Recall 0.500 0.357 0.643 0.429 0.769
Table 24
Travel KNN NB CNB TF-IDF SMO
Accuracy 0.963 0.986 0.950 0.963 0.963
Precision 0.563 0.818 0.471 0.563 0.583
Recall 0.900 0.900 0.800 0.900 0.700
Table 25
Gastronomy KNN NB CNB TF-IDF SMO
Accuracy 0.986 0.945 0.995 0.991 0.995
Precision 0.921 0.745 1.000 0.971 1.000
Recall 1.000 1.000 0.971 0.971 0.971
Table 26

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

70 Literature KNN NB CNB TF-IDF SMO
Accuracy 0.922 0.918 0.936 0.941 0.927
Precision 0.900 0.941 0.606 0.955 0.815
Recall 0.545 0.485 0.952 0.636 0.667
Table 27
Fashion KNN NB CNB TF-IDF SMO
Accuracy 0.945 0.932 0.950 0.945 0.945
Precision 0.783 0.656 0.769 0.741 0.760
Recall 0.720 0.840 0.800 0.800 0.760
Table 28
Politics KNN NB CNB TF-IDF SMO
Accuracy 0.932 0.941 0.927 0.927 0.904
Precision 0.577 0.615 0.556 0.552 0.469
Recall 0.789 0.842 0.789 0.842 0.789
Table 29
R&S KNN NB CNB TF-IDF SMO
Accuracy 0.909 0.895 0.918 0.918 0.918
Precision 0.744 0.705 0.795 0.735 0.848
Recall 0.780 0.756 0.756 0.878 0.683
Table 30

In order to better visualize the results presen ted above we include a set of charts,
representing the accuracy of each of the discussed classifiers for each topic. Then we also
compute an overall accuracy of the classifiers. All the values were above 93%.

Figure 41

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

71
Figure 42

Figure 43

Figure 44

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

72
Figure 45

Figure 46

Figure 47

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

73
Figure 48

Figure 49
KNN NB CNB TF-IDF SMO
Accuracy 94.4% 93.3% 94.9% 95.1% 94.3%
Table 31 Overall accuracy of the classifiers

Figure 50 Overall accuracy of each classifier

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

74 The ranking of our classifiers was the following:
1) TF-IDF
2) CNB
3) KNN
4) SMO
5) NB/NBU
Another interesting set of charts we include present the accuracy for each topic
achieved by each classifier. We notice that whi le Gastronomy is the in disputable leader , the
last topi c is not so easy to determine, each learning algorithm having very different
approaches. We present the charts in order of the classifier ranking.

Figure 51

Figure 52

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

75
Figure 53

Figure 54

Figure 55

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

76 6.2 Future Development and Improvements

This present work presents an attempt to put the bases for a Romanian text
categorization instrument. Because this is mainly a research project, we seeded multiple
directions for future development plans.
To begin with, the category we have selected are surely not enough to present a full
scale map of the Romanian blogosphere. Surely the training set can be expanded for more
topics like: music, games, IT, etc. For this we can either offer users the possibil ity to add
keywords in our database, or to classify blogs. The list of Romanian blogs is available at
http://www.zelist.ro/ , the source for our materials as well. We also mention here that the
automatic keyword selection was done after we gathered a list of manually labeled blogs. The
labor was done with the help of FILS students, coordinated by professor Bujor P ăvăloiu.
Furthermore we can discuss here about two types of sources from the Web. First of all
there are the blogs themselves and their meta -infor mation (blog username, topics, posts’
length, posts’ frequency, dates, etc.) , that was already targeted by the current project, and
secondly the social network generated by the dynamic linking between blogs. These views
enable the study of the blogosphere in various dimensions. We included in the project the
ability to extract from each analysed blog a blogroll. This offers the ability to start a project in
web structure mining and using graph theory to construct a social graph of the blogosphere,
or maybe of parts of it. This kind of tools can be very useful not only for social scientists, but
also for the bloggers as well. They can visualize their links, the influencers of a network
(nodes that create a dense area around them), interesting topics, etc.
Besides the new research directions the project creates, we can also discuss here about
improvements that could be brought for the work itself. Putting aside the necessary topic
coverage extension, there are two very important aspects left. The crawler modul e can
definitely be extended and adapted, to make sure it can successfully extract data from any
blog. There is a potential problem here, the generalization of the module can lead to a noisier
input, risk that should be mitigated in any way possible. The s econd aspect that should be
considered is the improvement of the classifier. Right now the research project took into
consideration several machine learning techniques and made a study of their performance. In
the end it would be better to choose the best one and improve and personalize it as much as
possible.
All in all the path uncovered by the present work is only at the beginning and it does
not only extend straight ahead, but also offers the possibility to go in other directions as well.
We consider th e analysis of the Romanian social media to be a very interesting and
resourceful research, with lots of facts waiting to be discovered and modeled.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

77 References

[1] Ajith Abraham, Aboul -Ella Hassanien, Vaclav Snasel. 2010. Computational Social
Network Analysis. London : Springer, 2010.
[2] Christopher D. Manning, Prabhakar Raghavan, Hinrich Schütze. 2009.
Introduction to Information Retrieval. s.l. : Cambridge University Press, 2009.
ISBN/0521865719.
[3] Contextual lin king behaviour of bloggers: leveragering text mining to enable topic –
based analysis. Mackassy, Sofus A. 2011. 2011.
[4] Danah M. Boyd, Nicole B. Ellison. 2008. Social Network Sites: Definitions, History,
and Scholarship. Journal of Computer -Mediated Communication. 2008, 13.
[5] Domingos, Pedro. 2012. A Few Useful Things to Know about Machine Learning.
Communications of the ACM. 2012, Vol. 55, 10.
[6] Elkan, Charles. 2010. Evaluating Classifiers. 2010.
[7] Gaber, Mohamed Medhat. 2010. Scientific Data Mining and Knowledge Discovery.
s.l. : Springer, 2010.
[8] Guandong Xu, Yanchun Zhang, Lin Li. 2010. Web Mining and Social Networking.
s.l. : Springer, 2010.
[9] Hanneman, Robert A. 2005. Introduction to Social Network Methods. Riverside :
Universi ty of California, 2005.
[10] Jo, Taeho. 2010. NTC(Neural Text Categorizer): Neural Network for Text
Categorization. International Journal of Information Studies. 2010, Vol. 2, 2.
[11] Joachims, Thorsten. 1997. A Probabilistic Analysis of the Rocchio Algorithm with
TFIDF for Text Categorization. Fachbereich Informatik. 1997.
[12] JohnC.Platt. 2000. Fast Training of Support Vector Machines using Sequential
Minimal Optimization. Microsoft Research. 2000.
[13] Leveraging Sentiment Analysis for Topic Detection. Keke Cai, Scott Span gler,
YingChen, Li Zhang. 2008. 2008. International Conference on Web Intelligence and
Intelligent Agent Technology.
[14] M. Ikonomakis, S. Kotsiantis, V. Tampakas. 2005. Text Classification Using
Machine Learning Techniques . WSEAS TRANSACTIONS on COMPUTERS. 2005,
Vol. 4, 8.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

78 [15] Mark Hall, Eibe Frank, Geoffrey Holmes, Bernhard Pfahringer, Peter
Reutemann, Ian H. Witten. 2009. The WEKA Data Mining Software: An Update.
SIGKDD Explorations. 2009, Vol. 11, 1.
[16] Miguel E. Ruiz, Padmini Srinivasan. Automatic Text Categoriz ation Using Neural
Networks. Advances in Classification Research . Vol. III.
[17] Modeling blogger influence in a community. Nitin Agarwal, Huan Liu, Lei Tang,
Philip S. Yu. 2011. 2011.
[18] Nitin Agarwal, Huan Liu, Lei Tang, Philip S. Yu. 2011. Modeling blogger
influence in a community. 2011.
[19] OAS. 2013. OAS: Knowledge -based Society. Organization of American States.
[Online] Organization of American States, 2013. [Cited: June 13, 2013.]
http://www.oas.org/en/topics/knowledge_society.asp.
[20] Pavaloiu, Ionel -Bujorel. 2013. Integrarea identității culturale in retelele sociale
[National identity integration in social networks]. Bucuresti : Academia Romana,
2013.
[21] Predictive Semantic Social Media Analysis. Ostrowski, David Alfred. 2011. 2011.
[22] Raymond Kosala, Hendrik Blockeel. 20 00. Web Mining Research: A Survey.
SIGKDD Explorations. 2000, Vol. 2, 1.
[23] School, Winter. 2003. Data Mining Techniques and Tools for Knowledge Discovery
in Agricultural Datase. New Delhi : Library Avenue, 2003.
[24] Semantically interconnected social networks. Alessandro Cucchiarelli, Fulvio
D’Antonio, Paola Velardi. 2011. 2011.
[25] Socialbakers. 2013. Social Media Market, Statistics & Monitoring Tools. [Online]
2013. http://www.socialbakers.com.
[26] Stanley Wasserman, Katherine Faust. 1999. Social Network Analysis. Cam bridge :
Cambridge University Press, 1999.
[27] Xindong Wu, Vipin Kumar. 2009. The Top Ten Algorithms in Data Mining. s.l. :
CRC Press, 2009.
[28] Zhang, Guoqiang Peter. 2000. Neural Networks for Classification: A Survey. IEEE
Transactions on systems, man and cybern etics-Part C: Applications and Reviews.
2000, Vol. 30, 4.

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

79 Glossary

A
AI – Artificial Intelligence
ANN – Artificial Neural Network

B
Botnet – collection of internet -connected programs communicating with other similar
programs in order to perform tasks. The word stems fr om the words: robot and network

C
CB-SN – Content Based Social Network
CRC – Cyclic Redundancy Check , error -detecting code used for storage devices to detect
accidental changes to raw data
CNB – Complement Naï ve Bayes , machine learning technique

D
DDoS attack – distributed denial -of-service attack is an attempt to make a machine or
network resource unavailable to its intended users
Dendrogram – is a tree diagram used to present the arrangement of the clusters prod uced by
hierarchical clustering
DM – Data Mining

E
ERD – Entity Relationship Diagram

H
HTML – HyperText Markup Language
HTTP – HyperText Transfer Protocol
Hyperlink – Structural component that connects the web page to a different location

I
IE – Information Extraction
IR – Information Retrieval

J
JPA – Java Persistence API
JSF – JavaServer Faces ( technology is a server -side component framework for building Java
technology -based web applications )
Jsoup – Java library for working with real -world HTML

K
KDD – Knowledge Discovery for Databases
KDT – Knowledge Discovery in Text

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

80 KNN – K Nearest Neighbor

M
Malware – Short for malicious software used to disrupt computer operation and gather
sensitive information
Metadata – data about data
ML – Machine Learning

N
NLP – Natural Language Processing
NN – Neural Network
NB – Naïve Ba yes, machine learning algorithm
NBU – Naïve Bayes Updata ble, machine learning algorithm

P
Profile – representation of each user, his/her social links inside a social network site

Q
QP – Quadratic Programming, the problem of optimizing (minimizing or maximizing) a
quadratic function of several variables subject to linear constraints on these variables

R
Regex – regular expression that matches a pattern to a string

S
SMO – Sequential Minimal Optimization (of the SVM classifier)
SNA – Social Network Analysis
SNSs – Social Network Sites
SSD – System Sequence Diagram
SSD – System Sequence Diagram
Stemming – process for reducing words to their stem , base or root form
Stoplisting – process of filtering out stop words
SVM – Support Vector Machines

T
TF-IDF – Term Frequency Inverse Document Frequency

W
WEKA – Waikato Environment for Knowledge Analysis
WHATWG – Web Hypertext Application Technology W orking Group, community of
people interested in evolving HTML and related technologies
WM – Web Mining

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

81 Annexes
Weka Classifier – TextClassifier.java – important methods

public void addCategory(String category) {
category = category.toLowerCase();
classValues.add(category);
}

public void uploadTrainingInstances() {
try {
BufferedReader br;
String filename;

if (classifier.getClass().equals(J48.class)) {
filename = "trainingSet2.arff";
} else if (classifier.getClass().equals(MultilayerPerceptron.class)) {
filename = "trainingSet 3.arff";
} else {
filename = "trainingSet.arff";
}
br = new BufferedReader(new InputStreamReader(
this.getClass().getClassLoader().getResourceAsStream("resources/" + filename),
"UTF -8"));
trainingData = new Instances(br);
// setting class attribute
trainingData.setClassIndex(trainingData.numAttributes() – 1);
} catch (IOException ex) {
Logger.getLogger(TextClassifier.class.getName()).l og(Level.SEVERE, null, ex);
}
upToDate = false;
}

public void addData(String topic) throws IllegalStateException {
if (!setup) {
throw new IllegalStateException("Must use setup first");
}
List<Ke yword> trainSet =
MainController.getInstance().findKeywordByCategory(topic);
topic = topic.toLowerCase();
if (trainSet.size() > 0) {
for (Keyword keyword : trainSet) {
// Make message into instance.
Instance instance = makeInstance(keyword.getKeyword(), trainingData);
instance.setWeight(keyword.getWeight());

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

82 // Set class value for instance.
instance.setClassValue(topic);
// Add instanc e to training data.
trainingData.add(instance);
}
upToDate = false;
}
}

private void buildIfNeeded() throws Exception {
if (!upToDate) {
// Initialize filter and tell it about the input format.
filter.setInputFormat(trainingData);
// Generate word counts from the training data.
filteredData = Filter.useFilter(trainingData, filter);
// Rebuild classifier.
classifier.buildClassifier(filteredData);
upToDate = true;
}
}

public double[] classifyMessage(String message) throws Exception {
message = message.toLowerCase();

buildIfNeed ed();
//get a copy of the instance structure.
Instances testset = trainingData.stringFreeStructure();
Instance testInstance = makeInstance(message, testset);
// Filter instance.
filter.input(testInstance);
filter.batchFinished();
Instance filteredInstance = filter.output();

return classifier.distributio nForInstance(filteredInstance);
}

private Instance makeInstance(String text, Instances data) {
// Create instance of length two.
Instance instance = new SparseInstance(2);
// Set value for message attribute
Attribute messageAtt = data.attribute("text");
instance.setValue(messageAtt, messageAtt.addStringValue(t ext));
// Give instance access to attribute information from the dataset.
instance.setDataset(data);
return instance;
}

Diploma Thesis, Roxana -Teodora R ĂDULESCU , Faculty of Engineering in Foreign Languages, UPB, 2013

83 public void setupAfterCategorysAdded() {
attributes.add(new Attribute("topic", classValues));
// Create dataset with initial capacity of 500, and set index of class.
trainingData = new Instances("TextClassificationProblem", attributes, 5000);
trainingData.setClassIndex(trainingData.numAttributes() – 1);
setup = true;
}

Similar Posts