The title of the page is one of the most [628917]

CAPITOLUL 1
Introduction
1. Internet of Things
The Internet, one of the most important technologies that has suc-
ceeded in in
uencing lifestyle, has been created by people for people.
The dynamics of change leads the Internet to a new level, the new
internet not only linking people to it, but also things. This Internet
of Things (IoT) is able to connect all smart devices to be monitored
and controlled remotely. The Internet of All Things (IoT) is a concept
about the interconnection of uniquely identi able computer devices in-
tegrated into existing Internet infrastructure. Usually, IoT is expected
to o er advanced connectivity to devices, systems and services that go
beyond machine-to-machine (M2M) communication and cover a variety
of protocols, domains and applications. The interconnection of these
integrated devices (including smart objects) is expected to be inaugu-
rated in automation in almost all areas, while also allowing advanced
applications such as smart grids.
Through things, in IoT, one can understand a variety of devices,
such as heart monitor implants, biochips transponders on farm animals,
in-car sensors, or eld-operated devices that help search and rescue -
re ghters. Current market examples include smart thermostat systems
and a washing / drying machine that use wi for remote monitoring.
The Internet of things is not based only on computers to exist. Each
object, even the human body, can become a part of The Internet of
Things if equipped with certain electronic components. These parts
it certainly varies depending on what you have to perform the object,
but fall into two large categories:
the object must be able to capture data, usually through thro-
ugh sensors.
the object must be able to transmits this data elsewhere thro-
ugh via the Internet.
A sensor and a connection, therefore, are the two primary electro-
nic parts of an object included in the Internet of Things. According
to industry analysts, in the year 2015 were between 10 and 20 bil-
lion objects connected to the Internet. This ecosystem of connected
objects form the foundation The Internet of Things. Number of ob-
jects connected in 2015 was small in comparison with how many will
be connected in 2020. Estimates vary, but generally they are predicts
1

2 1. INTRODUCTION
that the number of objects connected up to 2020 will be 40-50 billions
including everything from pens to dwellings, machinery and industrial
equipment. Integration with the Internet implies that devices will use
an IP address as a unique identi er. However, due to the limited IPv4
address space (which allows 4.3 billion unique addresses), IO objects
will need to use IPv6 to t into the extremely large space of required
addresses. Objects in IoT will not only be devices with sensory capabi-
lities, but will also provide drive capabilities (eg Internet light bulbs or
locks). To a large extent, the future of the Internet of All Things will
not be possible without IPv6 support; and therefore the adoption of
IPv6 worldwide in the coming years will be essential for the successful
development of IoT in the future. The integrated IT environment of
multiple IoT devices means low cost computing platforms. In fact, to
minimize the impact of such devices on the environment and power
consumption, low power radios are likely to be used to connect to the
Internet. Such low-power radio devices do not use WiFi, or known cel-
lular network technologies, and remain an actively developing research
area. But IoT will not only be built from integrated devices, because
higher computing devices will be needed to perform more complex ser-
vice tasks (routing, switching, data processing, etc.). In addition to
the multitude of new application domains for automation connected to
the Internet, IoT will also generate large amounts of data from vari-
ous locations, aggregated and very fast, thus increasing the need for
better indexing, storage and processing of such data. Various appli-
cations need di erent scenarios and implementation requirements that
were typically used in proprietary implementation. However, since IoT
involves Internet connectivity, most devices that o er IoT services will
need to work using standardized technologies.
1.0.1. The bene ts of this technology.
Smart homes equipped with such devices can greatly help to
save energy by reducing unnecessary consumption, ensuring
optimal heating and lighting according to the occupants' acti-
vities and weather conditions. Sensors located in these homes
can alert in real time about various damages or incidents such
as
ooding, re, or the intrusion of a stranger.
Intelligent cars will be connected to trac monitoring systems
and will know how to avoid trac jams and nd the closest
available parking spaces. In the event of technical failures,
they will be detected by the sensors and will be reported in
real time to the owner.
Smart cities will provide citizens with services that can make a
major contribution to protecting the environment. Important
resources (water, gas, electricity or heat) can be monitored
and controlled by automation and distributed more eciently

1. INTERNET OF THINGS 3
in line with real needs, and distribution companies can be
alerted immediately if infrastructure problems arise.
The medical system will also bene t from the bene ts of IoT
technology. Continuous monitoring of the vital signs of the
patient's body may contribute to their safety and to timely
inform doctors when health problems arise. Sensors can help
administer prescription drugs, bringing enormous bene ts to
hospitals, dorms or the elderly, where careful care of the care-
givers is at least as important as timely interventions.
1.0.2. The disadvantages and risks of this technology.
Connectivity issues. IoT technology involves devices intercon-
nected and dependent on each other. When a device is com-
promised, yields or delivers erroneous data, the e ect will pro-
pagate to all systems that depend on it directly or indirectly.
The e ect may be relatively minor (when the planning appli-
cation no longer passes the alarm clock on an appointment) or
major (when a sensor that monitors vital functions of a patient
does not inform the doctor about a medical incident).
Security issues. Eager to launch a new product on the market
as soon as possible, designers of IoT equipment often neglect
security issues. For this reason, more and more smart devi-
ces (phones, TVs, surveillance cameras, or refrigerators) are
involved in massive cyber attacks.
Con dentiality issues of stored data. Most of these systems are
vulnerable to cyber attacks, lacking required security applica-
tions, or security policies that impose passwords of satisfactory
complexity, and do not protect data stored properly. Concei-
ved badly, these systems may allow some intruders to take a
look at the recorded data without them being detected.
Waste energy. A study published by the International Energy
Agency shows that the approximately 14 billion Internet-connected
devices currently in the world are scattering an enormous amo-
unt of electricity due to inecient technologies, and this pro-
blem will worsen in 2020, when the electricity scattered by
these devices will increase by 50. The waste of energy is due
to the fact that the devices use more electricity than they
should to maintain the connection and communicate with the
network.
The Internet of Things (IoT) will allow entering a new economic era for
the whole world. The perspectives o ered by IoT are not refers only
to simple improvements of processes and economic models, but rather
to transform the domain of their application. The IoT economy will

4 1. INTRODUCTION
revolutionize the way in which economic organizations carry out pro-
duction, operation and development. And the change happens more
quickly than in any previous industrial revolution. At the same time,
the Internet of Things will produces signi cant challenges in all sectors
and for all industries. Although solves problems that have a ected
business time for decades, if not centuries, will create, however, com-
pletely new procedural and ethical dilemmas. Concerns about privacy
personal data, cyber security, as well as property and responsibility on
products will grow with development of new speci c applications The
Internet of Things. Economic organizations will have to begin to im-
plement IoT technology if want to survive in the long run, however will
also have to implement strategies to meet the many risks associated
with IoT.
1.1. Internet Provider.
The Internet de nes a worldwide system of interconnected networks,
which allows data communication services, such as: opening a remote
work session, transferring le, e-mail, and discussion groups. The In-
ternet is a way of connecting existing networks of computers, which
greatly extends the possibilities of each participating system. This ne-
twork not only it is an inexhaustible source of information, but at the
same time it is a new form of communication between people. From
an informational point of view, the Internet represents a huge reservoir
of information that can be stored and transmitted electronically: text,
images, movies, sound. This information is available free of charge or
for a fee, as the information are public or private.
The Internet o ers several kinds of services that can be listed:
WWW- World Wide Web, the most-developed service. Its
existence is based on the concept of hypertext, materialized in
the programming language called HTML (HyperText Markup
Language) and on programs able to interpret this language,
called web browsers.
Email, the most used service, allows users to exchange messa-
ges who have access to this service, anywhere in the world
FTP – File Transfer Protocol, allows to transfer les between
computers connected to Internet
UseNet, discussion groups on the most diverse topics
Telnet, allows access to a server on the Internet as if the user
was in his face.
An Internet service provider (ISP) is an organization that provides ser-
vices for accessing, using, or participating in the Internet. Internet ser-
vice providers may be organized in various forms, such as commercial,
community-owned, non-pro t, or otherwise privately owned. Typica-
lly, ISPs also provide their customers with the ability to communicate

1. INTERNET OF THINGS 5
with one another by providing Internet email accounts, usually with
numerous email addresses at the customers discretion. Other services,
such as telephone and television services, may be provided as well. The
services and service combinations may be unique to each ISP.
1.2. Big Data.
Big Data refers to humongous volumes of data that cannot be processed
e ectively with the traditional applications that exist. The processing
of Big Data begins with the raw data that isnt aggregated and is most
often impossible to store in the memory of a single computer. Big Data
is very familiar term that describes voluminous amount of data that is
structural, semi-structural and substructural data that has potential to
be mined for information. Although big data does not refer any speci c
quantity, then this term is often used when speaking about the pet
bytes and Exabyte of data. Big data, the contemporary use of parallel
processing to derive value from large-scale, heterogeneous data sets, has
begun a transformational shift across society that has already changed
the way business operates and academia evaluates performance, and
promises to reshape society at large. Big Data can be de ned with an
extension of the 3V model, which comprise of: volume, velocity, value,
veracity and variety.
Volume of data is concerned with data the generation and in
the order of upto zettabytes, and it is estimated to increase
around 40 procent every year.
Veracity. IBM coined Veracity as the fourth V, which repre-
sents the unreliability inherent in some sources of data. For
example, customer sentiments in social media are uncertain in
nature, since they entail human judgment. Yet they contain
valuable information. Thus the need to deal with imprecise
and uncertain data is another facet of big data, which is ad-
dressed using tools and analytics developed for management
and mining of uncertain data.
Value is treated as the era of cost associated with data, while
the data are being generated, collected and analyzed from di-
erent quarters. Accordingly data itself can be a commodity
that can be sold to third parties for revenue, which aid in
budget decision making at estimating the storage cost of the
data.
Variety deals with various types of data, semi-structured and
unstructured data such as audio, video, webpage, and text, as
well as traditional structured data.
Velocity refers to the era of streaming data, data collection
and analysis must be performed at a much faster rate and in
time ecient manner there by the timely commercial value of
big data can be maximized.

6 1. INTRODUCTION
1.2.1. Role Of Big Data.
(1) Improving Healthcare and Public Health
The big data is in extended use in the eld of medicine and
healthcare. As the technology raises the cost of health care is
also increasing more and more. Big data is a great helping
hand in this issue. It is a great help for even physicians to
keep track of all the patients history. The link to the patients
history can be accessed only by the patient and his particular
physician. The computing power of big data analytics enables
us to decode entire DNA strings in minutes and will allow us
to nd new cures and better understand and predict disease
patterns. Researchers can mine the data to see what treatment
are more e ective for particular conditions, identify patterns
related to drug side e ects or hospital read missions, and gains
other important information that can help patients and reduce
costs. Recent technologic advances in the industry have im-
proved their ability to work with such data, even though the
les are enormous and often have di erent database structures
and technical characteristics.
(2) Big Data In Data Mining Datameers decision trees automati-
cally help users understand what combination of data attribu-
tes result in a desired outcome. Decision trees illustrate the
strengths of relationships and dependencies within data and
are often used to determine what common attributes in
uence
outcomes such as disease risk, fraud risk, purchases and on-
line signups. The structure of the decision tree re
ects the
structure that is possibly hidden in your data.
(3) Personal Quanti cation and Performance Optimisation
Big data is not just for companies and governments but
also for all of us individually. We can now bene t from the
data generated from wearable devices such as smart watches
or smart bracelets.
(4) Manufacturing and Natural Resources
In the natural resources industry, big data allows for pre-
dictive modeling to support decision making that has been
utilized to ingest and integrate large amounts of data from ge-
ospatial data, graphical data, text and temporal data. Areas
of interest where this has been used include; seismic interpre-
tation and reservoir characterization.
(5) Big Data Contributions to Public Sector
Big data provides a large range of facilities to the govern-
ment sectors including the power investigation, deceit recogni-
tion, tness interconnected exploration, economic promotion
investigation and ecological forti cation.
(6) Improving and Optimising Cities and Countries

1. INTERNET OF THINGS 7
Big data is used to improve many aspects of our cities and
countries. For example, it allows cities to optimise trac
ows
based on real time trac information as well as social media
and weather data. A number of cities are currently piloting big
data analytics with the aim of turning themselves into Smart
Cities, where the transport infrastructure and utility processes
are all joined up. Where a bus would wait for a delayed train
and where trac signals predict trac volumes and operate
to minimise jams.
(7) Financial Trading
High-Frequency Trading (HFT) is an area where big data
nds a lot of use today. Here, big data algorithms are used
to make trading decisions. Today the bank performs its own
credit score analysis for existing customers using a wide range
of data, including checking, savings, credit cards, mortgages,
and investment data.
(8) Communications, Media and Entertainment
Since consumers expect rich media on-demand in di erent
formats and in a variety of devices, some big data challen-
ges in the communications, media and entertainment industry
include:
Understanding patterns of real-time, media content usage
Leveraging mobile and social media content
Collecting, analyzing, and utilizing consumer insights
Organizations in this industry simultaneously analyze custo-
mer data along with behavioral data to create detailed custo-
mer pro les that can be used to:
Recommend content on demand
Create content for di erent target audiences
Measure content performance
(9) In Smart Phones
People now carry facial recognition technology in their po-
ckets. Users of I Phone and Android smart phones have appli-
cations at their ngertips that use facial recognition technology
for various tasks.
(10) Contributions to Learning
Along with the online learning, there are many examples
of the use of big data in the education industry.
Adaptive learning : Further than just reformation cour-
sework and the grading development, data-driven classrooms
opened up the understanding of what children learn when they
study it and to what height. Enterprises produce digital co-
urses that use big-data-fuelled prognostic analytics to locate
what a learner is learning and what components of a lecture
plan most excellently ensembles them at those situations.

8 1. INTRODUCTION
Problem control : Sometimes, a student submits his friends
homework instead of his own. In that situation, instead of get-
ting the punishment he gets appreciation and the other inno-
cent student gets the punishment. So in these situations, big
data entertains the cross checks of the assignments in order to
nd out whose writing matches with the assignments writing.
In a di erent use case of the use of big data in educa-
tion, it is also used to measure teachers e ectiveness to ensure
a good experience for both students and teachers. Teachers
performance can be ne-tuned and measured against student
numbers, subject matter, student demographics, student aspi-
rations, behavioral classi cation and several other variables.
1.2.2. Big data analytics. Big data analytics examines large amo-
unts of data to uncover hidden patterns, correlations and other insi-
ghts. Big data analytics helps organizations harness their data and use
it to identify new opportunities. The following techniques represent a
relevant subset of the tools available for big data analytics:
(1) Text analytics (text mining) refers to techniques that extract
information from textual data. Social network feeds, emails,
blogs, online forums, survey responses, corporate documents,
news, and call center logs are examples of textual data held
by organizations. Text analysis involves information retrieval,
lexical analysis to study word frequency distributions, pattern
recognition, tagging/annotation, information extraction, data
mining techniques including link and association analysis, vi-
sualization, and predictive analytics. The overarching goal
is, essentially, to turn text into data for analysis, via appli-
cation of natural language processing (NLP) and analytical
methods.Text analytics methods used:
(a) Sentiment analysis
Analysing the opinion or tone of what people are saying
about your company on social media or through your call
centre can help you respond to issues faster, see how your
product and service is performing in the market, nd out
what customers are saying about competitors, and so on.
(b) Information extraction (IE) is the task of automatica-
lly extracting structured information from unstructured
and/or semi-structured machine-readable documents. In
most of the cases this activity concerns processing human
language texts by means of natural language processing
(NLP). Recent activities in multimedia document proces-
sing like automatic annotation and content extraction out
of images/audio/video could be seen as information ex-
traction.

1. INTERNET OF THINGS 9
(c) Question answering (QA) aims to automatically answer
natural language questions posed by human.
(d) Text summarization techniques automatically produce a
succinct summary of a single or multiple documents. The
resulting summary conveys the key information in the ori-
ginal text(s). The main idea of summarization is to nd
a subset of data which contains the "information" of the
entire set. Document summarization tries to create a re-
presentative summary or abstract of the entire document,
by nding the most informative sentences, while in image
summarization the system nds the most representative
and important (i.e. salient) images. For surveillance vi-
deos, one might want to extract the important events from
the uneventful context. There are two general approaches
to automatic summarization: extraction and abstraction.
Extractive methods work by selecting a subset of existing
words, phrases, or sentences in the original text to form
the summary. In contrast, abstractive methods build an
internal semantic representation and then use natural lan-
guage generation techniques to create a summary that is
closer to what a human might express. Such a summary
might include verbal innovations.
(2) Audio analytics
Audio analytics analyze and extract information from un-
structured audio data.
(3) Video analytics
Is the capability of automatically analyzing video to de-
tect and determine temporal and spatial events. The increa-
sing prevalence of closed-circuit television (CCTV) cameras
and the booming popularity of video-sharing websites are the
two leading contributors to the growth of computerized video
analysis.
(4) Predictive analytics
Predictive analytics is a form of advanced analytics that
uses both historical and new data to forecast behavior, ac-
tivity and trends. It involves applying statistical analytical
queries, analysis techniques and automated machine learning
algorithms to data sets to create predictive models that place
a numerical value (or score) on the likelihood of a particu-
lar event happening. In practice, predictive analytics can be
applied to almost all disciplines, from predicting the failure of
jet engines based on the stream of data from several thousand
sensors, to predicting customers next moves based on what
they buy, when they buy, and even what they say on social
media.

10 1. INTRODUCTION
(5) Social media analytics. How Facebook is Using Big Data
Apart from Google, Facebook is probably the only com-
pany that possesses this high level of detailed customer infor-
mation. The more users who use Facebook, the more informa-
tion they amass.
Apart from analyzing user data, Facebook has other ways
of determining user behavior.
Facial recognition: One of Facebooks latest investments
has been in facial recognition and image processing capabili-
ties. Facebook can track its users across the internet and other
Facebook pro les with image data provided through user sha-
ring.
Tracking cookies: Facebook tracks its users across the web
by using tracking cookies. If a user is logged into Facebook and
simultaneously browses other websites, Facebook can track the
sites they are visiting.
Analyzing the Likes: A recent study conducted showed
that is viable to predict data accurately on a range of personal
attributes that are highly sensitive just by analyzing a users
Facebook Likes. Work conducted by researchers at Cambridge
University and Microsoft Research show how the patterns of
Facebook Likes can very accurately predict your sexual orien-
tation, satisfaction with life, intelligence, emotional stability,
religion, alcohol use and drug use, relationship status, age,
gender, race, and political viewsamong many others.
Tag suggestions: Facebook suggests who to tag in user
photos through image processing and facial recognition.
1.3. The most visited sites.
YouTube.com is the most used video site. Users use it to watch
videos, movies, music, tutorials from di erent domains, and upload
their video les in order to promote or transmit certain information.
According to his own data, YouTube has over one billion unique visitors
per month. Users also upload about 100 hours of video les per minute.
Wikipedia.org, the online encyclopedia created and updated by
users, has reached over 4.5 million pages of its own content. Wikipedia
has about 21.3 million registered users on the site and has become a
source of reference information in many areas of activity.

1. INTERNET OF THINGS 11
Taobao.com is a Chinese site, the most popular sales site from con-
sumer to consumer. The Taobao.com site, created by Alibaba Group
in 2003, o ers small businesses the opportunity to create online stores
and develop small businesses selling new and second-hand products.
According to gures last year, about 760 million products are listed on
the site.

12 1. INTRODUCTION
Amazon
Amazon is structured foremost as a virtual market! From an online
store that only dealt with books at the beginning, its ideologist Je
Besoz has had the genius of turning it into the world's largest virtual
online market – online! Before Ebay had taken great speed but Amazon
is now in short supply. Now the Amazon structure is almost perfect
and includes a world-class,
exible and easy-to-use tool for global sales
of goods! The Amazon is theoretically structured in three parts: Ge-
nuin Amazon. This category includes the original Amazon products.
Books, ebooks and digital products. This aspect of the Amazon is
just an internal thing that does not intrude on the client. Amazon
Retail. In this case Amazon works just like any normal physical store.
"Buy" wholesale products and resell them with commercial additions.
(in reality, it is more complicated than that: Amazon makes separate
agreements with any major supplier). In this category we will nd the

1. INTERNET OF THINGS 13
brands: Apple, Samsung, Nike, etc. But also smaller suppliers. In
this version, Amazon deals with everything that is selling, delivering,
customer sodisfaction, etc! Amazon Marketplace! As this name says,
it is the real virtual market created by Amazon! Practically, Amazon
is the market hub in this case where everyone exposes their merchan-
dise waiting for their customers! Amazon is the world's largest online
sales platform today. It is a direct retailer, it sells its products, and
it is a reseller, that is, it sells the products to others. As a retailer,
marketer or retailer, Amazon Retail and Amazon Marketplace are two
di erent channels embedded in the same platform that expose milli-
ons of products to 310 million active Amazon customers around the
world. Amazon Retail works similarly to all major stores and is very
similar to any retail or wholesale retail chain. Although it sells most of
its products, in this case, Amazon is the one who "buys" the product
(pays only after it has been sold), manages it, sells it, and deals with
delivery logistics. Suppliers negotiate wholesale prices for items that
Amazon purchases. In this case, the seller only focuses on maintaining
a happy customer and willing to buy. Amazon does the rest. Price
margins are set, the product is sold directly to Amazon, the order is
sent to the implementation center and the transaction is made. Re-
turns are sent back to Amazon, not to the vendor. Amazon Retail
products usually do not compete with each other for their shopping
cart. Amazon Retail is excellent for high performance products. Ama-
zon Marketplace is the platform that allows anyone to sell directly to
the end user or the online customer. Amazon receives a commission
for each sale. The seller takes over his marketing responsibilities and
manages his prices. Some suppliers use Marketplace to introduce new
products with greater price control. It's also a good way to sell slow-
moving products such as last year's models or accessories. Sales on
Amazon Marketplace are made in two models: The trader (or seller)
assumes responsibility and transportation costs to the consumer. By
default, merchandise shipped by merchants is not eligible for Prime,
except for proven and clear Amazon executed merchants. Amazon (or
seller in this case) uses its logistics which is more performance to send
larger quantities of ordered products to the Amazon centers spread
across the country. (Not in Romania, Amazon does not have open dis-
tribution centers in Romania, and delivery orders are made by Amazon
partner couriers). Amazon can also request advertising sales. Deposit
charges, shipping charges and other charges may apply. The seller is
responsible for securing the stock of products in the Amazon logistics
centers so that the products maintain a constant presence in the wa-
rehouse. Amazon Marketplace and Amazon Retail play an important
role in providing everyday items that are harder to nd in countries like
America but also in Western European countries where an online sho-
pping culture has already formed. Amazon Marketplace is a collection

14 1. INTRODUCTION
of retailers and independent sellers selling their products through the
Amazon system. These retailers use Amazon services such as logistics
promotion and management and instead pay Amazon a percentage of
sales. Very often, Marketplace retailers can go beyond national bor-
ders to access new markets by selling their products around the world
through the Amazon infrastructure.
IMDB
IMDb, also known as Internet Movie Database, is an online data-
base of information related to world lms, home videos and video
games,television programs and internet streams, including cast, per-
sonnel and ctional character biographies, production crew, plot sum-
maries,fan reviews and ratings and trivia. The movie and talent pages
of IMDb are accessible to all internet users, but a registration process
is necessary to contribute information to the site. All volunteers who
contribute content to the database technically retain copyright on their

2. NATURAL LANGUAGE PROCESSING 15
contributions but the compilation of the content becomes the exclusive
property of IMDb with the full right to copy, modify, and sublicense it
and they are veri ed before posting. IMDb does not provide an API
for automated queries. However, most of the data can be downloaded
as compressed plain text les and the information can be extracted
using the command-line interface tools provided.[25] There is also a
Java-based graphical user interface (GUI) application available that is
able to process the compressed plain text les, which allows a search
and a display of the information.As one adjunct to data, the IMDb
o ers a rating scale that allows users to rate lms on a scale of one to
ten.
2. Natural Language Processing
The processing of natural language or computational linguistics is
an interdisciplinary science that combines computer science, linguistic
and arti cial intelligence; are aimed at researching written and spoken
language.Natural Language Processing (NLP) is an analytic method
used to extract information from (typically) unstructured texts.NLP is
a special application of machine learning; popular algorithms can be
found in texts on machine learning and arti cial intelligence. Some
practically achievable objectives are: to evaluate media resources that
represent or partially represent learning objects and catalog it into a
machine-readable data; real-time (or near real-time) evaluation of stu-
dents' (virtual) classroom discussions, questions, assignments, etc. for
sentiment, structure, content and complexity; and to support careful
reading of students' written products for in-depth evaluation, to detect
plagiarism, and so forth.
2.1. Keyword Extraction.
Automatic keyword extraction is the process of selecting words and
phrases from the text document that can at best project the core sen-
timent of the document without any human intervention depending on
the model [1]. Extracting keywords is one of the most important tasks
when working with text. Readers bene t from keywords because they
can judge more quickly whether the text is worth reading. Website
creators bene t from keywords because they can group similar content
by its topics. Algorithm programmers bene t from keywords because
they reduce the dimensionality of text to the most important features.
A keyword extraction algorithm has three components:
(1) Candidate selection: At this stage it extracts all possible words,
phrases, terms or concepts that can potentially be keywords.
(2) Properties calculation: For each candidate, it is necessary to
calculate properties that indicate that it may be a keyword.
(3) Scoring and selecting keywords: All candidates can be scored
by either combining the properties into a formula, or using

16 1. INTRODUCTION
a machine learning technique to determine probability of a
candidate being a keyword. A score or probability threshold,
or a limit on the number of keywords is then used to select the
nal set of keywords.
2.1.1. Keyword Extraction Methods.
In supervised keyword extraction approaches, the keyword extrac-
tion task is treated as a binary classi cation problem. A classi er
determines whether each word or phrase in the document is a keyword.
The drawback of supervised keyword extraction approaches is the need
for a labeled corpus. The quality of the training corpus directly a ects
the performance of the model, thus a ecting the results of keyword ex-
traction. Unsupervised keyword extraction methods include linguistic
analysis, statistical methods, topic methods, and network graph based
methods. These methods are used to extract keywords from an unlabe-
led corpus. Compared to supervised approaches, the major advantage
of unsupervised methods is that there is no need of a manually labelled
corpus. Unsupervised keyword extraction methods are:
(1) TF-IDF
Tf-idf stands for term frequency-inverse document frequency,
and the tf-idf weight is a weight often used in information
retrieval and text mining. This weight is a statistical measure
used to evaluate how important a word is to a document in a
collection or corpus. The importance increases proportionally
to the number of times a word appears in the document but is
o set by the frequency of the word in the corpus. Variations
of the tf-idf weighting scheme are often used by search engines
as a central tool in scoring and ranking a document's relevance
given a user query.
Advantages: Without the need for a labeled corpus. Easy
to implement, widely applied.
Drawbacks: Cannot extract semantically meaningful words.
The keywords are not comprehensive. Not accurate enough.
The tf-idf weight is composed by two terms:
TF: Term Frequency, which measures how frequently a
term occurs in a document. Since every document is di-
erent in length, it is possible that a term would appear
much more times in long documents than shorter ones.
Thus, the term frequency is often divided by the docu-
ment length (aka. the total number of terms in the docu-
ment) as a way of normalization:
TF(t) = (Number of times term t appears in a document)
/ (Total number of terms in the document).
IDF: Inverse Document Frequency, which measures how
important a term is. While computing TF, all terms are

2. NATURAL LANGUAGE PROCESSING 17
considered equally important. However it is known that
certain terms, such as "is", "of", and "that", may appear
a lot of times but have little importance. Thus we need
to weigh down the frequent terms while scale up the rare
ones, by computing the following:
IDF(t) = loge(Total number of documents / Number of
documents with term t in it).
(2) TextRank
TextRank is a graph-based ranking model for text processing
which can be used in order to nd the most relevant sentences
in text and also to nd keywords.
Advantages: Without the need for a labeled corpus. Has a
strong ability to apply to other topic texts.
Drawbacks: Ignored semantic relevance of keywords. The
e ect of low frequency keyword extraction is poor. High com-
putational complexity
Identify relevant keywords
In order to nd relevant keywords, the textrank algorithm
constructs a word network. This network is constructed
by looking which words follow one another. A link is set
up between two words if they follow one another, the link
gets a higher weight if these 2 words occur more frequ-
enctly next to each other in the text.
Identify relevant sentences
In order to nd the most relevant sentences in text, a
graph is constructed where the vertices of the graph re-
present each sentence in a document and the edges be-
tween sentences are based on content overlap, namely by
calculating the number of words that 2 sentences have in
common.
(3) LDA
In natural language processing, latent Dirichlet allocation (LDA)
is a generative statistical model that allows sets of observations
to be explained by unobserved groups that explain why some
parts of the data are similar.
Advantages: Without the need for a labeled corpus. Can
obtain semantic keywords and solve the problem of polyse-
mous. Easy to apply to various languages.
Drawbacks: Prefer to extract general keywords which can-
not represent the topic of corresponding text well.
(4) RAKE
RAKE short for Rapid Automatic Keyword Extraction algo-
rithm, is a domain independent keyword extraction algorithm

18 1. INTRODUCTION
which tries to determine key phrases in a body of text by ana-
lyzing the frequency of word appearance and its co-occurance
with other words in the text.
Advantages: Without the need for a corpus. Very fast and
the complexity is low. Easy to implement.
Drawbacks: Cannot extract semantically meaningful words.
Not accurate enough
2.1.2. Keyword-Based Patent Analysis.
Keyword-based analysis has been applied to a wide range of patent
mining tasks. Technology evolution analysis, technology theme gene-
ration, technology breakthrough innovation, and technology transfor-
mation are important contents of patent mining. Another applica-
tion scenario of using keywords for patent analysis is technology sub-
ject clustering. To cluster technical topics, one of the commonly used
approaches is based on keywords. The purpose of patent technology
clustering is to discover distribution of technology themes.
2.2. Dialogue Systems.
A dialog system is a computer system intended to converse with a hu-
man, with a coherent structure. Dialog systems have employed text,
speech, graphics, haptics, gestures and other modes for communica-
tion on both the input and output channel. Advances on dialogue
systems are overwhelmingly contributed by deep learning techniques,
which have been employed to enhance a wide range of big data appli-
cations such as computer vision, natural language processing, and re-
commender systems. For dialogue systems, deep learning can leverage
a massive amount of data to learn meaningful feature representations
and response generation strategies, while requiring a minimum amount
of hand-crafting.
There are two types of dialogue systems, conversational systems
and command-based systems.

2. NATURAL LANGUAGE PROCESSING 19
Command-based Conversational
Metaphor Voice interface meta-
phor.Human metaphor.
Language Constrained com-
mand language.Unconstrained spon-
taneous language.
Utterance length Short utterances. Mixed.
Semantics Simple semantics.
Less context depen-
dence.Complex seman-
tics. More context
dependence.
Syntax More predictable. Less predictable.
Language models Strict grammar, possi-
bly large vocabulary.Less strict grammar,
possibly smaller voca-
bulary.
Language coverage
challengeHow to get the user to
understand what co-
uld be said.How to model every-
thing that people say
in the domain.
2.3. Task-Oriented Dialogue Systems.
The typical structure of a pipeline based task-oriented dialogue system
consists of four key components:
Dialogue state tracker. It manages the input of each turn along
with the dialogue history and outputs the current dialogue
state.
Language understanding. It is known as natural language un-
derstanding (NLU), which parses the user utterance into pre-
de ned semantic slots.
Natural language generation (NLG). It maps the selected ac-
tion to its surface and generates the response.
Dialogue policy learning. It learns the next action based on
current dialogue state.
2.4. Spoken dialogue system components.
Spoken dialogue systems are indeed complex systems, incorporating a
wide range of speech and language technologies, such as speech recog-
nition, natural language understanding, dialogue management, natural
language generation and speech synthesis. These technologies do not
only have to operate together in real-time, but they also have to operate
together with a user, which may have individual needs and behaviours
that should be taken into account.

20 1. INTRODUCTION
The speech recogniser (ASR) takes a users spoken utterance (1) and
transforms it into a textual hypothesis of the utterance (2). The na-
tural language understanding (NLU) component parses the hypothesis
and generates a semantic representation of the utterance (3), normally
without looking at the dialogue context. This representation is then
handled by the dialogue manager (DM), which looks at the discourse
and dialogue context to, for example, resolve anaphora and interpret
elliptical utterances, and generates a response on a semantic level (4).
The natural language generation (NLG) component then generates a
surface representation of the utterance (5), often in some textual form,
and passes it to a text-to-speech synthesis (TTS) which generates the
audio output (6) to the user.
2.5. Machine Translation.
Machine translation refers to software that can perform instant transla-
tion of content from one language to another. The escalating growth of

2. NATURAL LANGUAGE PROCESSING 21
mobile and social media along with the increasing e-commerce industry
are major reasons contributing towards the need for translation and lo-
calization services. Although the concepts behind machine translation
technology and the interfaces to use it are relatively simple, the science
and technologies behind it are extremely complex and bring together
several leading-edge technologies, in particular, deep learning (arti cial
intelligence), big data, linguistics, cloud computing, and web APIs.
2.5.1. Rule-Based Machine Translation Technology.
Rule-based machine translation relies on countless built-in linguistic
rules and millions of bilingual dictionaries for each language pair.
2.5.2. Statistical Machine Translation Technology.
Statistical machine translation utilizes statistical translation models
whose parameters stem from the analysis of monolingual and bilingual
corpora. Building statistical translation models is a quick process,
but the technology relies heavily on existing multilingual corpora. A
minimum of 2 million words for a speci c domain and even more for
general language are required.
2.5.3. Rule-Based Machine Translation vs. Statistical Machine Transla-
tion.
The advantage of RBMT is that a good engine can translate a wide
range of texts without the need for large bilingual corpora, as in sta-
tistical machine translation. However, the development of an RBMT
system is time-consuming and labor-intensive and may take several
years for one language pair.
The key advantage of statistical machine translation is that it eli-
minates the need to handcraft a translation engine for each language
pair and create linguistic rule sets. With a large enough collection of
texts, you can train a generic translation engine for any language pair
and even for a particular industry or domain of expertise. Statistical
Machine Translation is the core of systems used by Google Translate
and Bing Translator, and is the most common form of MT in use today.

22 1. INTRODUCTION
Rule-Based Machine
TranslationStatistical Machine Transla-
tion
Consistent and predictable
qualityUnpredictable translation
quality
Out-of-domain translation
qualityPoor out-of-domain quality
Knows grammatical rules Does not know grammar
High performance and ro-
bustnessHigh CPU and disk space
requirements
Consistency between ver-
sionsInconsistency between ver-
sions
Lack of
uency Good
uency
Hard to handle exceptions
to rulesGood for catching excep-
tions to rules
High development and cus-
tomization costsRapid and cost-e ective de-
velopment costs provided
the required corpus exists
2.5.4. Neural Machine Translation.
Neural machine translation, or NMT for short, is the use of neural
network models to learn a statistical model for machine translation.
The key bene t to the approach is that a single system can be trained
directly on source and target text, no longer requiring the pipeline of
specialized systems used in statistical machine learning.
The steps neural network translations go through are the following:
Each word, or more speci cally the 500-dimension vector re-
presenting it, goes through a rst layer of neurons that will
encode it in a 1000-dimension vector.
The process is repeated several times, each layer allowing bet-
ter ne-tuning of this 1000-dimension representation of the
word within the context of the full sentence
The nal output matrix is then used by the attention layer
that will use both this nal output matrix and the output of
previously translated words to de ne which word, from the
source sentence, should be translated next.
The decoder (translation) layer, translates the selected word
in its most appropriate target language equivalent.
2.5.5. Measuring Quality of MT.
Various automatic evaluation methods are available to measure simi-
larity of MT translation and that from a human translator. Some
examples:
Bilingual Evaluation Understudy (BLEU) computes the n-
gram precision rather than word error rate.

2. NATURAL LANGUAGE PROCESSING 23
Position-independent error rate (PER) calculates the word er-
ror rate by treating each sentence as a bag of words and igno-
ring the word order.
Metric for Evaluation of Translation with Explicit Ordering
(METEOR) takes stemming and synonyms into consideration.
2.6. Information retrieval.
Information retrieval is the activity of obtaining information resour-
ces relevant to an information need from a collection of information
resources, and the part of information science, which studies of these
activity. The objective of such processing is to facilitate rapid and
accurate search of the text based on keywords of interest.
2.6.1. Text Information Retrieval.
One of the most common and well known application of information
retrieval is the retrieval of text documents from the internet. With
its recent growth, the internet is fast becoming the main media of
communications for business and academic information. Thus it is
essential to be able to tap the right document from this vast ocean of
information.
A search engine is a server that sends a robot to surf the Internet
on its own capture the title, keywords, and content of the pages that
make up the sites. All pages found are then recorded in a database.
Pages may also be registered by one person, manually, through a form.
The moment a user searches with a motor searching after a speci c
phrase or word, the search engine will look into this database and
depending on certain priority criteria, it will create a list of results
that it will display in the form of result. Usually the interface is also
a web page accessed through an address. Search engine optimization
is basically the intervention made in the source code of the pages that
aims to develop and focus the keywords that are representative of the
site's activity object so that it gets top positions in the search results
and consequently more visitors. Search Engines (google, yahoo, msn,
altavista, alexa, jeeves, etc.) found in a Continuous competition use
their own methods to index a site. Some search engines focus on the
text content of the site; others read meta tags where site information
is found, but most search engines use a combination of page content,
meta tags, link popularity, etc. to determine the importance the site
placement in their listings. Optimization steps:
Keyword Optimization: The keyword optimization step is the
most important, keywords being the ones that bring success
to a website in the internet world. Keywords can not be ran-
domly chosen and should be searched carefully so they are
representative of the site's referencing object. Keyword opti-
mization consists of thorough analysis and research that has

24 1. INTRODUCTION
been entered correctly in the page text and the source code of
the pages will result in an increase in visitors trac.
Optimize page title: The title of the page is one of the most
important elements of a web page. Title optimization will
bring extra target visitors, interested in what the site sells,
o ers, shows, etc. Optimizing the title of a page is also a stage
in which analyzes are required studies.
Optimize site text: The text optimization step consists of in-
troducing special tags into the source code of the pages so that
certain words or even sentences become more important than
the rest of the text. Also in this optimization stage you can
reassess the text in the page.
Optimize site links: Link optimization can consist of inserting
links or removing some of the page. Linking links to pages is
one of the most important elements in the optimization pro-
cess.
Google vs. Bing
Both sites look are pretty similar when it comes to basic search
results. Main di erences between Google and Bing:
Bings video search is signi cantly better than Googles. Instead
of giving you a vertical list of videos with small thumbnails, it
gives you a grid of large thumbnails that you can click on to
play without leaving Bing.
Bing gives more autocomplete suggestions than Google does
in most cases. Google only gives four, while Bing gives eight.
Googles shopping suggestions show up more often than Bings
do, and theyre generally much better.
Googles Image Search interface feels a bit smoother when you
use it, though Bing has one or two more advanced options like
Layout and it lets you remove certain parts of your search term
with one click.
2.6.2. Multimedia Information Retrieval.
In this era of information overloading, the amount of information avai-
lable to us is simply so much that it is virtually impossible for us to
deal with in an ecient manner. One solution to this problem is to
set up databases for multimedia data. Hundreds of television and ra-
dio broadcasts would then be covered by a database application which
keeps track of the information available.
3. Readability Metrics
For as long as people have originated, shared, and studied ideas
through written language, the notion of text diculty has been an
important aspect of communication and education. As part of this
systematic approach, text readability has been more formally de ned

3. READABILITY METRICS 25
as the sum of all elements in textual material that a ect a readers un-
derstanding, reading speed, and level of interest in the material (Dale
& Chall, 1949). These elements may include features such as the com-
plexity of sentence syntax; the semantic familiarity to the reader of
the concepts being discussed; whether there is a supporting graphic or
illustration; the sophistication of logical arguments or inference used
to connect ideas; and many other important dimensions of content. In
addition to text characteristics, a texts readability is also a function of
the readers themselves: their educational and social background, inte-
rests and expertise, and motivation to learn, as well as other factors,
can play a critical role in how readable a text is for an individual or
population.
3.1. Automated Readability Assessment. After many studies
with di erent methodology, there is no consensus on which statistical
learning model is more appropriate for the task of readability predic-
tion. The choice of learning technique often depends on multiple fac-
tors, such as the nature of annotated reading diculty for training
data, the audience and the speci c applications.
3.1.1. Readability assessment as a machine learning problem.

Similar Posts