TWITTER TWEETS ANALYSIS BASED ON HASH -TAGS [602315]
1
TWITTER TWEETS ANALYSIS BASED ON HASH -TAGS
Limboi Sergiu George, group 258
ABSTRACT
Twitter Analysis is an important area that reveals domain s of interest and expresses the
feelings and opinions of people about main topics. This paper addresses the problem of
generating news stories starting from messages posted on Twitter. We present an unsupervised
learning approach based on clustering techniq ue that shows the process of grouping tweets
around hash -tags. Lastly, we provide an analysis of the results that shows important tendencies
about different subjects and help us in determining future events.
Cate gories and Subject Descriptors
H.3.3 [Inform ation Storage and Retrieval]: Information Search and Retrieval [9]
Keywords
Twitter, News, Clustering, Topic Detection and Tracking
1. INTRODUCTION
Twitter is a popular social media for communicating with other people, expressing
feelings and opinions and broadcasting news. The advantages of such a powerful tool are [1]: the
availability on different electronic devices, the opportunity to have a large friend pool and the
fact that you can send small and concise messages (called tweets) to other friends and on a
variety of subjects. Nowadays is a challenge to gather all relevant data, detect and summarize
news on a specific topic. For a user seems to be a problem to find other users with interesting
tweets due to the fact that it has to read through status u pdates and follow links attached to the
tweet in order to obtain more information.
The goal of this paper is to present a method to collect, preprocess and group messages
from Twitter on different topics or news stories . The problem is represented by the Topic
Detection and Tracking area [2]. This topic implies story (news) detection, cluster detection and
tracking. The grouping will be made based on Twitter hash -tags to identify possible topics.
Hash -tags [4] are keywords prefixed with the “#” symbol that can appear in a tweet. Twitter
users use this notation to categorize their messages and enabled or marked them to be more
easily found in search. So, we can say that hash -tags are indicators of tweets topics. For grouping
tweets we use an unsupervised lea rning approach, clustering [4], more precisely the k -means
technique, conducive to obtain news stories.
Twitter analysis and the process of clustering t weets based on topics is very important,
because it reveals the main subjects that people are interested. We can determine the topics for a
2
certain period (based on publication date ) and the result s can help us to better understand the
impact of an event or news and how people reacted to it. For the reason that tweets include rich
structured information (meta -data composed by source location, description, photo, etc) about
the users involved in the communication process, we can figure out the domain of interest for
different ge ographical regions. Most messages contain little information value, but the
aggregation of millions of messages can generate important knowledge.
Another important aspect that has impact on the grouping of tweets is expressed by the
characteristics of new s in Twitter [2]: tag a user, embed a link, re -tweet and use a hash -tag. Tag a
user aids in identifying conversations between users. To insert a link in a message, called artifact
[1] (link tweets to related material on Internet like article, photo, and video) is useful to find
more information about a topic. Re -tweeting means postin g again a tweet and it indicates the
popularity of a message. Last but not least we can group together related messages by using
hash-tags.
The rest of the paper is organize d as follows. Section 2 presents the background of the
problem approached highlighting different ways existing in the literature for the grouping of
news from Twitter. Section 3 introduces our perspective specifying the proposed system, the
essential steps and the results. Section 4 contains the conclusions of the paper and suggests the
future research guidelines .
2. BACKGROUND & RELATED WORK
Twitter is an online social networking service that facilitates users to send and read short
messages (140 characters) called “tweets” [6]. The analysis of Twitter is a research area with
high and growing interest due to the fact that some research problems are poorly defined and
new difficulties are described day by day. In recent years, researchers have focus ed on issues like
event detection, topic mining, sentiment analysis and opinion mining.
Paul and Dredze [7] have studied the mining of public health information from Twitter.
Messages like “I got flu” are common and knowing that about a specific user is n ot interesting,
but millions of such messages can be revealing, such as tracking the influenza rate in specific
countries. So, the medical (health) topic extracted from tweets can provide valuable insights into
a population. The Sentiment Analysis [6] of tweets has the objective of classifying the polarity
of a given text, sentence or feature. Grouping tweets by sentiment (positive, negative or neutral)
represents a first step toward measuring public opinions such as political sentiment which helps
in trac king political opinions and predicting election results.
Popescu and Pennacchiotti [3] approached the event detection problem. They presented
an automatic detection of events which engage large social media audiences. The methods are
focused on detecting controversial events which are events that provoke a public discussion in
3
which members express opinions or disbeliefs. The main concept of this problem is a twitter
snapshot, which is a triple consisting of a target entity, a given period time and a set o f tweets
about the entity from the period time. The controversial event detection was modeled in two
steps. The first stage is assigning a controversy score to each snapshot and the second part is
ranking snapshots according to the controversy score.
Phuv ipadawat and Murata [2] suggested a method for collecting and grouping news
based on popularity and reliability. The timeline aspect was taking into consideration, t he
number of re -tweet s and hash -tags being imperative . The approach consists in 2 steps: story
finding and story development. The story finding implies sampling (messages are fetched trough
the Twitter streaming API), indexing (index based on the content of messages) and grouping
(messages that are similar to each othe r are grouped together to form news story). Each group is
established evaluating the reliability and the popularity. The reliability is determined from the
number of followers from all the users who posted messages in the group. The popularity is
determined from the numbers of re -tweets within the group. The story development part implies
the fact that each news story is adjusted with appropriate ranking through a period of time.
Twitter Stand [1] is a system build for capturing tweets that corres pond to the latest
breaking news. The problem is the situation that tweets are not sent according to a schedule, they
tend to be noisy and they occur as news is happening. The issues presented are: removing t he
noise, determining tweet clusters and determi ning relevant locations associated with the tweets.
The key strategies shown are represented by online algorithms (algorithms that work on datasets
where the input is one element at a time), extract news from noise, modify the algorithms to be
kept up with the dynamic system (new users, delete d users, etc.), identify core groups of people
who tweet about news and obtain rich content (links to related material about news). Twitter
Stand is designed with an online clustering, the leader -following algorithm [1 ]. In this approach
there is no re -clustering and it is kept a list with active clusters (clusters with time centroid < 3
days). Another important process is the geo -tagging, which means mapping the resulted clusters
to geographical regions. Geo-tagging is represented by toponym recognition and toponym
resolution. Toponym recognition means finding all instances of textual references to geographic
locations (toponyms) in the text. Toponym resolution is determining th e correct geographic
coordinates for each recognized toponym of all possible interpretations.
Petrovic, Osborner and Lavrenko [8] proposed an algorithm for detecting news events
from a stream of Twitter posts. The method is based on locality -sensitive hashing. The subject is
the first story detection (FSD) problem. Given a sequence of stories, the goal of FS D is to
identify the first story to discuss a particular event. The streaming model of computation is
composed by items (tweets) that arrive continuously in a chronological order and they are
processed in bounded space and time. The locality -sensitive hashing implies hashing each query
point into buckets in such a way that the probability of collision is much higher for points that
are nearby . The stream of d ocuments is unbounded and coming down at a very fast rate. So,
there is a limit on the amount of space and time for processing a document. In other words, only
4
one pass over the data is allowed and the decision has to be made immediately after a new tweet
arrives.
3. TWITTER ANALYSIS FOR TWEETS FROM AMERICA
This section presents our approach for grouping tweets from the American continent
based on hash -tags.
3.1. Tweets
The main concepts that are used in a Twitter environment are: user, friend, follower,
tweet, hash -tag and re -tweet. A user is a person or a system that can posts messages on Twitter
[1]. This social media defines a friend -follower relationship. For example , let consider two users
a and b. The user a has the option to receive all the tweets written by user b. So, b becomes
friend of a and a is a follower of b. The vice -versa relation is not mandatory, because user b is
not forced to receive the messages from user a. Also, a user is defined by several properties:
name, source location, list of friends and followers, number of tweets, p hoto and a short
description .
In the Twitter background there are several abbreviations: RT and DM. RT means re –
tweet, so post ing again a message and DM signifies Direct Message when you want to send a
message to a specified user. Usually, this is done by prefixing the user name with a “@” symbol
(e.g. @john).
A tweet is a short (140 maximum characters) and simple message which is posted on
Tweet. In such a message there are 2 important elements: fact and emotions. Emotions consists
of symbols like!, sensational adjectives (e.g. crazy, amazing, shocking) and sensational phrases
(e.g. oh my God!). The facts part is text -based, mea ning details of news in terms of what, where,
when or how. These terms aids us in identifying keywords as significant nouns and verbs. The
nouns are found in conventional news, names of places, events or famous people. Examples of
significant verbs are: wi n, rescue, fire, etc.
Another part of the facts are the hash -tags (messages prefixed with the “#” symbol) and
hypertext -based elements: related media (maps, photos, videos) and related website s.
3.2. Input
Data is a valuable research resource in order to test and evaluate some experiments and
approaches. The Twitter dataset that is used for our experiment is an American tweets dataset
[5]. The data set is a free database of 200.000 tweets posted in America measured in 48 hours.
From this input the re -tweets are excluded, only the original tweets are processed. An entity or
5
instance f rom the data set is characterized by the following features: Twitter id , date, hour,
username, nickname, biography, tweet conten t, country, place, profile photo, number of
followers, number of following, tweet language and tweet url.
So, an entity is modeled like an object with a list of attributes.
E =( id, date, hour, username, nickname, bio, content, country , place, photo, followers, following,
language, url)
Example of entity:
E1:
Twitter id:72131
Date:2016 -04-16
Hour: 12:44
Username: Bill Schulhoff
Nickname: BillSchulhoff
Bio: Husband
Tweet content: #007 #jamesbond #blacktie #redcarpet #vip #gambling #youonlylivetwice
#diamondsareforever… https://t.co/ZXRyEC9Dlg
Country: US
Place: East Patchogue, NY
Photo: http://pbs.twimg.com/profile_images/378800000718469152/535032cf772ca04524
e0fe075d3b4767_normal.jpeg ,
Followers:386
Following:705
Language: en
Url: http://www.twitter.com/BillSchulhoff/status/721318437075685382
3.3. Data pre -processing
In order to build our system, we have to perform several steps that imply normalization
and tokenization. The dataset is loaded into the system (Excel files are parsed) and each instance
is processed as below:
Words are whitespace tokenized and converted in lowercase
Rare terms (terms with minimum frequency in tweets) are eliminated
Non alpha -numeric characters (except Twitter symbols like #,@) are removed.
6
3.4. Proposed system
The built system consists of several parts that describes the entire process of grouping
tweets based on hash -tags in order to obtain relevant clusters that reveals the main topics from
the American continent from a certain period of time.
Figure 1
System components
In Figure 1, we have the entire system for grouping American tweets. The input is a
dataset with 200.000 tweets collected via Twitter API. The Tweet retrieval is a component that
read and parses the dataset and loads the instances in the system. The data pre -processing part is
used to facilitate the clustering step. The clustering algorit hm is used to do the effective grouping
of tweets and finally the result is composed by clusters with certain tweets, each cluster being
defined by a hash -tag (important topic or news story).
3.5. Clustering
An important step to obtain news stories from tweets is the clustering process. For this
stage we used the k -means technique [10]. K -means is an unsupervised method which has as
main element the centroid (point which is the center of the cluster). An entity is considered to be
in a cluster if it is cl oser to the centroid of that cluster than any other centroid. The k -means is
described as follows:
i) Define k centroids, one for each cluster. Initially, the centroids are chosen random.
Input (Dataset)
Tweet retrieval Data pre -processing
Clustering algorithm Resulted clusters based on hash –
tags
7
ii)
Repeat
Assign entities to clusters based on the current selection of centroids
Update the centroids based on the new assignments in the cluster
Until convergence is reached
3.6. Experiment
The first step in building the system is r eading the dataset and loading the instances in the
system. The data set has 200.000 tweets retrieved from the American continent. After the data is
pre-processed we follow the clustering algorithm. In order to compute the centroids we have to
define a sim ilarity measure that indicated if an entity is similar (has common features) with other
entities. For this, we use the Jaccard distance [10]:
where A and B are tweets viewed as sets of unordered words that contain hash -tags.
This distance has the p roperties:
It is small if A and B are similar
It is large if they are not similar
It is 0 if they are the same
It is 1 if they are completely different
Using this measure, we can compute the centroid. In our approach the centroid is represented by
a has h-tag. The clustering process starts with some predefined hash -tags extracted from the data
set (e.g. #election 2016 , #nba, #hollywood, etc.) The k value from k -means algorithm is set to 10.
After the clustering is finished we obtain 10 clusters, each of them defined by a hash -tag
(represent s the news story). A cluster can also contain other hash -tags.
Example:
Clust er -> centroid=#election2016 and the following tweets:
{A protester is ejected for interrupting #donaldtrump #ctpolitics #election2016 ,
#2016presidentialelection #hilary #hilaryclinton , Kicking of our campaign for this election year
#election 2016}
8
The importance of these results is the fact that we can see what are the mess ages about a
certain topic, in which countries these topics where discussed (Canada , USA, Mexico) and in
what period of time where posted these messages (based on date and hour). Also, we can process
another clustering step taking the countries as centroid s. In this si tuation we can observe what
kind of messages represent domain of interest in a specific country. All in all, this experiment
present s some tendencies for the American continent based on hash -tags from messages posted
on Twitter .
4. CONCLUSION AND FUTURE WORK
In this paper we proposed an unsupervised approach for the Twitter Analysis problem.
We grouped messages sent in the Twitter environment based on the hash -tags which characterize
the tweets. Our exp eriment for an American dataset reveals the subjects (news stories) that
interest people from this geographical zone and what are the opinions and feeling expressed
about these topics.
In the future we want to process bigger data sets and to apply some supervised learning
techniques. Also, we p lan to integrate our system with some APIs with geographical content in
order to have the possibility to visualize the regions with the desired clusters. Another
improvement can be the interoperability of clustering with SOM (Self -Organizing Maps), which
is another unsupervised learning perspective.
5. REFERENCES
[1] Sankaranarayanan, J, H. Samet, B. E. Teitler, M. D. Lieberman and Sperling, J.,
TwitterStand:News in tweets, Proceedings of the 17th ACM SIGSPATIAL International
Conference on Advances in Geographic Information Systems, GIS „09, ACM, New York, NY,
pp.42 -51, 2009.
[2] Phuvipadawat, S., Murata, T., Breaking news detection and tracking in Twitter,
IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent
Technology (WI -IAT), Vol. 3, Toronto, ON, pp. 120 –123. 2010.
[3] Popescu, A. M., and Pennacchiotti, M. , Detecting controversial events from Twitter,
Proceedings of the 19th ACM international Conference on Information and Knowledge
Management, CIKM ‟10, ACM, New York, NY , pp.1873 –1876. 2010.
[4] Rosa K.D., Shah R., Lin B., Gershman A., Frederking R., Topical Clustering of Tweets,
Proceedings of the ACM SIGIR: 2011, New York, NY, pp.67 -75, 2011
[5] ** *, Follow the hashtag, http://followthehashtag.com/datasets/
9
[6] ***, Sentiment Analysis of Twitter Data, http://www.slideshare.net/sumit786raj/sentiment –
analysis -of-twitter -data
[7] Paul M., Dredze M., You Are What You Tweet: Analyzing Twitter for Public Health, 5th
International AAAI Conference on Weblogs and Social Media, Barcelona, Spain, pp.265 -272,
2011
[8] Petrovic S., Osborner M., Lavrenko V., Streaming first story detection with appli cation to
Twitter, P roceeding HLT '10 Human Language Technologies: The 2010 Annual Conference of
the North American Chapter of the Associatio n for Computational Linguistics, pp 181 -189, Los
Angeles, California , 2010
[9] ***, Computing Classification System 1998,
http://scidok.sulb.unisaarland.de/ccs_ebene3.php?buchstabe=H.3&anzahl=11&la=en
[10] ***, Tweets clustering,
http://www3.nd.edu/~dwang5/courses/spring15/assignments/A2/Assignment2_SocialSensing.ht
ml
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: TWITTER TWEETS ANALYSIS BASED ON HASH -TAGS [602315] (ID: 602315)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
