Templatelicentaeng2018 (1) [615171]

FACULTY OF AUTOMATION AND COMPUTER SCIENCE
COMPUTER SCIENCE DEPARTMENT

Semantic Oriented Aspect -Based Opinion Mining
Using Dependency Relations

LICENSE THESIS

Graduate : Flavia -Gabriela REBREAN

Supervisor : Prof. S.L. Dr. Ing. Emil -Stefan CHIFU

2018

FACULTY OF AUTOMATION AND COMPUTER SCIENCE
COMPUTER SCIENCE DEPARTMENT

DEAN, HEAD OF DEPARTMENT ,
Prof. dr. eng. Liviu MICLEA Prof. dr. eng. Rodica POTOLEA

Graduate: Flavia -Gabriela REBREAN

LICENSE THESIS TITLE

1. Project proposal: The purpose of the project is to classify opinions(feelings)
based on dependency relations from different documents, using an unsupervised
neural network as the opinion classifier.

2. Project contents: (enumerate the main component parts) Presentation page,
advisor's e valuation, title of chapter 1, title of chapter 2, …, title of chapter n,
bibliography, appendices.

3. Place of documentation : Technical University of Cluj -Napoca, Computer
Science Department

4. Consultants :

5. Date of issue of the proposal : May 1, 201 8

6. Date of delivery : September 10, 2018

Graduate : ____________________________

Supervisor : ____________________________

FACULTY OF AUTOMATION AND COMPUTER SCIENCE
COMPUTER SCIENCE DEPARTMENT

FACULTY OF AUTOMATION AND COMPUTER SCIENCE
COMPUTER SCIENCE DEPARTMENT

Declarație pe proprie răspundere privind
autenticitatea lucrării de licență

Subsemnatul(a) _____________________ __________________________________
_________________________________________________________________ ,
legitimat(ă) cu _______________ seria _______ nr. ___________________________
CNP _______________________________________________ , autorul lucrării
______ __________________________________________________________________
________________________________________________________________________
____________________________________________elaborată în vederea susținerii
examenului de finalizare a studiilor de li cență la Facultatea de Automatică și
Calculatoare, Specializarea ________________________________________ din cadrul
Universității Tehnice din Cluj -Napoca, sesiunea _________________ a anului universitar
__________, declar pe proprie răspundere, că această lucrare este rezultatul propriei
activități intelectuale, pe baza cercetărilor mele și pe baza informațiilor obținute din surse
care au fost citate, în textul lucrării, și în bibliografie.
Declar, că această lucrare nu conține porțiuni plagiate, iar surse le bibliografice au
fost folosite cu respectarea legislației române și a convențiilor internaționale privind
drepturile de autor.
Declar, de asemenea, că această lucrare nu a mai fost prezentată în fața unei alte
comisii de examen de licență.
In cazul cons tatării ulterioare a unor declarații false, voi suporta sancțiunile
administrative, respectiv, anularea examenului de licență .

Data

_____________________ Nume, Prenume

_______________________________

Semnătura

1
Table of Contents

Chapter 1. Introduction (Heading 1 style) ………………………….. …………….. 3
1.1. Project context ………………………….. ………………………….. ………………………….. ……..

Chapter 2. Project Objectives ………………………….. ………………………….. …… 6

Chapte r 3. Bibliographic Research ………………………….. ……………………….. 8
3.1 Self -organized maps ………………………….. ………………………….. ………………………….. ….
3.2 SOM -unsupervised neural network ………………………….. ………………………….. …………
3.3 Learning the algorithm ………………………….. ………………………….. ………………………… 9
3.4 How to view clusters t………………………….. ………………………….. ………………………… 10

Chapter 4. Analysis and Theoretical Foundation ………………………….. …. 11
4.1 Web mining based on self -organized maps ………………………….. …………………………
4.1.1 Representation of words and documents ………………………….. ………………………..
4.1.2 Architecture of the web mining system ………………………….. ………………………….
4.1.3 Implementation of web mining system ………………………….. ………………………. 12

4.2 Self -organized maps grow hierarchically ………………………….. …………………………. 13
4.2.1 GHSOM -neural network SOM hierarchical ………………………….. …………………..
4.2.2 The training process ………………………….. ………………………….. ……………………. 14
4.2.3 The thresholds for convergence ………………………….. ………………………….. ……. 15

4.3 Enrich -GHSOM neural network ………………………….. ………………………….. …………. 16
4.3.1.The Enrich -GHSOM Learning Algorithm ………………………….. ………………………
4.3.2 Arhitecture and implementation ………………………….. ………………………….. …… 18
4.3.3 Internal build -up of the initial arborescen t state ………………………….. ………….. 20
4.3.4 Evaluation of the classification of Enrich -GHSOM ………………………….. …….. 21
4.3.5 Transformation from is format to unit ………………………….. ………………………….. .

4.4. Text -based enrichment of taxonomies ………………………….. ………………………….. … 22
4.4.1. Learning ontologies ………………………….. ………………………….. ……………………….
4.4.2. Taxonomic enrichment based on text ………………………….. ………………………….. .

2
4.4.3 Other approaches to enriching taxonomies ………………………….. …………………. 23
4.4.4 Classification of terms based on distributive similarity ………………………….. …….
4.4.5 Degree of automation ………………………….. ………………………….. ………………….. 24

4.5 A neural model for unsupervised taxonomy enrichment ………………………….. …………
4.5.1 Extraction of terms ………………………….. ………………………….. ……………………… 26
4.5.2 Linguistic analysis ………………………….. ………………………….. ……………………… 27
4.5.3 Identifying terms ………………………….. ………………………….. ………………………… 28
4.5.4 Counting Terms ………………………….. ………………………….. …………………………. 30
4.5.5 Taxonomy Enric hment ………………………….. ………………………….. ……………………
4.5.6 Centroid Vector ………………………….. ………………………….. ………………………….. 33

Chapter 5. Detailed Design and Implementation ………………………….. …. 35
5.1 Lex and YACC ………………………….. ………………………….. ………………………….. ………..
5.2.Stanford Parset ………………………….. ………………………….. ………………………….. …….. 37
5.3 Implementing Lexers and Parser ………………………….. ………………………….. ………… 40
5.4 GHSOM unit output file ………………………….. ………………………….. ……………………. 45

Chapter 6.Testing and Validation ………………………….. ……………………….. 49
5.1 Validation ………………………….. ………………………….. ………………………….. ………………..
5.1 Testing ………………………….. ………………………….. ………………………….. ………………… 51

Chapter 7 . Conclusions ………………………….. ………………………….. …………… 54
Bibliography ………………………….. ………………………….. ………………………….. 56
Appendix 1 (only if needed) ………………………….. ………………………….. ………..

Chapter 1
3
Chapter 1. Introduction

With years passi ng by, social media sites has become more and more popular.
Million of people spend most of their time searching on the internet. When they decide to
buy a new product, they make a research even more thorough. This is the reason why
they search for pro and cons of a buying a certain product. Of utmost importance are the
reviews. These review express feelings about a certain product. The write’s attitude my
be a positive or a negative one. Sentiment analysis is the process of computationally
identifying and classifying opinions expressed here with the help of reviews.
What we want to obtain is to classify opinions(feelings) based on dependency
relations from different documents, using a neural network.

1.1. Project context

If we classify words, nouns or nominal groups, it means their vectors are
transposed to the document vectors, considering that we extract all these vectors from the
occurence matrix word / document. Namely in the first case (words, nouns or nominal
groups) are rows, and in the second case (sma ll documents) there are columns. Or vice
versa, depends on how we have this matrix.
To the same idea, to us, instead of words, nouns or nomial groups, would be those
pair (appearance, word -bearing opinion), and in the above matrix it is counted the
number of occurences of such a pair in various documents.
We are going to use an unsupervised neural network as an opinion classifier. A
learning algorithm of a neural network can be either supervised or unsupervised. The
unsupervised neural network do no have ta rget outputs. We will not be able to predict
what the learning process will look like.
Self Organizing Maps Self -Organizing Maps (SOM) is a very popular model of
unsupervised neural network. The aim of the SOM is to analyze high dimensional data.
This t ype of unsupervised neural network is a data mining method used for visualization
for complex data sets of large volume. The outlook of the results will be presented in a
graphical representation of data clustering results. The purpose of cluster analysis is to
place objects into clusters or groups, taking into consideration the fact that a given cluser
is similar to each other in some way and it is different to other objects in different
clusters. The engine of neural logic is in charge of the clustering process. Advantages of
this type of clustering are: noise tolerance and robustness and a higher speed in
comparison to other clustering methods, such as hierarchy clustering. SOM purpose is to
map sets of large volume data into a two -dimensional output spa ce, which corresponds to
a graphical display of planning data. It is very similar to a two -dimensional rectangular
grid with each neuron in each node. A cluster in a SOM map is a contiguous group of
neurons in an area of the map in which all neurons host similar data items.

Chapter 1
4
Figure 1. Example of every -day cluster

The main advantage of self -organized maps is the capability to organize large
amounts of data in such a way that it brings to light the structure and the latent
relationships of the data set. The project presents its own system which achieves web
mining and it is based on the SOM map. Web mining is a process in which data mining is
performed. We extract the web documents and discover the patterns f rom it. The map is
organized in such a way that it is ready for the user to browse through it.
Our web mining system is able to handle a large collection of textual documents
(a corpus). Documents with similar content between them, occupy the same positi on or
neighbor positions on the map. It all depends on the degree of semantic similarity
between them. The user is able to navigate on the document map in order to obtain
relevant documents. The user can discover quickly what is new in a cluster due to t he fact
that the cluster gathers all relevant information.
The model based on the SOM model which we are going to use is the GHSOM
(growing hierarchical SOM), a more complex model. This is a well known model used in
the web mining. An important condition f or the web mining is having a solid ontology.
From previous work, the GHSOM model has been expanded to Enrich –
GHSOM which is widely used in ontological engineering. We are using this model in
order to enrich an ontology belonging to one domain. Enriching a taxonomy is adding
new ontological concepts by attaching them as daughter nodes of existing ontology
nodes. Taxonomy enrichment functions as a process of classifying the terms named in
relation to the taxonomy of a given ontology by associating each ter m or entity named
with one taxonomic target node. The association is based on similarity, according to a
similarity metric defined in the distributive vector space. The SOM map of every node
can be expanded horizontaly by inserting a row of neurons or a co lumn of neurons. The
expansion of the hierarchical tree hierarchy is carried out from top to bottom. From the
root advancing to the leaves.

Chapter 1
5
A taxonomic tree is a structure of essentially symbolic knowledge. It is converted
into a neuronal representation as the initial state of the hierarchical self -organized map.
This is considered a symbolic neuronal conversion. The process of enrichment of the
taxonomy is carried out by unsupervised entrainment of the neural network.
Our system operates on the taxonomic structure of an ontology, this is why
neural -symbolic conversions in both directions were realized. On the same wavelength
with the hierarchical structure of the self -organized Enrich -GHSOM neural network on
which the system is based.

Chapter 2
6
Chapter 2. Project Objectives

Figure 2: An example ontology for a digital camera

Testimonials, or reviews, reviews, are the main trigger factor of 20 -50% of total
purchases made, even surpassing the price. The position of the testimonies matters
Yes, the position counts, as the position of each element on the page counts. And
no, there is not a certain universal opinion that will give the best results every time for
everyone. For some visitors, a higher position should be used, and a lower position may
be used for others. Some visitors will react to a certain title, while others will not.
It has turned out that the photos of those who leave reviews do not just increase
the confidence in that review. Big names in the industry, including Google, use large and
clear pictures to show an overview, and in the software industry it's a common practice to
use the photos of the reviewers.
For even greater credibility, the adjective influence the entire outcome of an
review . The impact of them is well-known customers and it is can be extremely hard –
forged, so they offer the greatest credibility, and this also leads to a higher conversion
rate. The higher the form of adjective comparison , the more credible the review is. For
example: “extraordinary images”
The public benefit s from .gre ater confidence, which is why the reviews left by
them are more credible. Knowing that getting a review from a public person does not
necessarily have to be very difficult to achieve, all you have to do is ask. You will be
surprised at how many things peop le are willing to offer if they are asked.

Chapter 2
7
When analyzing the opinions regarding a certain aspect, the sentences are
represented by bags -of-words. These words being extracted in our project from the
provided xmls. What is of interest to us are just the no uns and the adjectives because
these have the most impact of the outcome of an review.
Believe it or not, a few (just read a few) negative reviews helps. All trader scores
are made up of positive and negative reviews. When you receive a negative review, t his
increases the credibility of positive reviews .
As regards sentiment analysis (also called opinion mining, review mining or
appraisal extraction, attitude analysis ), in an attempt to enclose the whole body of work
that has been carried out in the field, we can define it as the task of detecting, extracting
and classyfing opini ons, sentiments and attitudes .

Chapter 3
8
Chapter 3. Bibliographic Research
3.1 Self-organized maps
SOM architecture is one of the most popular neuronal neural network models.
SOM is an unsupervised au tomatic learning model and a customizable scheme of viewing
the knowledge. It is also considered a clustering method.
Each of the input data items is represented as a vector with numerical attributes in
a vectorial space. The vectorial space considered is R^n. În this space is possible adding
any vectors or scalling any vector. The similarity between input data entry items is
defined as a similarity between the vector representations of those data items. There
exists a metric of similarity between vectors. Each data item in the input space is mapped
into one of the neurons on the map. It results into a reduced dimension space, which is
represented as a rectangular SOM map, consisting of a two -dimensional grid with a
neuron in each node.
The SOM map discovers the latent relationships from the data input set. The data
inputs which are similar are placed în the neighbor of a cluster from the SOM map.
SOM is a data mining method and a representation for complex data of large
volumes. SOM is capable of describing different aspects of a phenomenon în any field.
The only condition to take into consideration is that the field to be able to be represented
aș vectors with numerical attributes.
SOM clusters similar words between them and instead of saying 687 words, you
say 110 classes of words similar to each other. We look for a kind of synonym, a kind of
synonym, similar to each other and counts them in the pile. And then there will not be
vectors of 687 components, so the columns but the 110 merged ones that match sim ilar
words. Similar documents should be clustered so that there are no more 238 documents to
be 100. The size m x n of the SOM, the rectangle we made smaller. Not 100 x 150 for
example 120 x 30 ex, then fewer similar words come out, in our case similar doc uments.

3.2 SOM -unsupervised neural network
As a result, learning takes place through the use of unaddressed entry data, as in
any unsupervised automatic learning process. The SOM map consists of a regular two –
dimensional (rectangular) grid of processing units called neurons. Each neuron has
associated a weight vector,
Learning the self -organized maps SOM is a unsupervised process of regression.
The map is a fully trained is a representation of the entire input data space – with optimal
precision. This r epresentation uses a small set of weight ratios associated with neurons on
the map. The weighting vectors of all neurons have the same number of attributes as the
training vectors, namely domain -specific numerical attributes.

Chapter 3
9

3.3 Learning the algori thm
||x(t) – mc(t)|| <= ||x(t) – mi(t)||

In paper [1], the author presents the algorithm of processing an input vector. This
happens with each iteration of t, the algorithm processes an input vector. The neuron of
the map is searched between the neurons o f the map that best suits the current input
vector. The chosen neuron is the neuron whose weight vector is most similar to the
current input vector, according to a certain metric of similarity. The current vector gets to
be mapped în the chosen neuron.We a djust the weight vectors of all map neurons or a
subset consisting of the neurons around the c -winning neuron, i.e. the neurons located in
the vicinity of the chosen neuron.
The similarities between the data items which are în the multidimensional data
space are also found în the two -dimensional space of the SOM map by adding similar
items în the neurons geographically close on the SOM map.

Figure 3. Difference between rectangular and hexagonal topogy

În case of a rectangular topology of the map, a n euron has four direct neighbor if
the neuron is not situated on one of the map’s extremities, meaning that is not situated on
one of the rows or columns representing the sides of the map. Another type of topology is
the hexagonal topology. A neuron has six direct neighbors in the hexagonal topology. The
advantage of using the hexagonal topology is the greater number of direct neighbors
adjacent to a winning neuron than the rectangular topology. Consequently, there is a
larger number of affected neurons dire ctly by adapting the weight vector, which leads to a
finer approximation of the input data space.
After training a SOM map, the weight vector of each neuron of the map is a
model learned by all the items of data mapped în the neuron. To the vector of each of the
data items mapped to the neuron, the weight vector results at the end of the training as an
average of the vectors of all these data items. The result is a map which keeps the input
space topology. Neurons with similar weight vectors are geographic ally close to the SOM

Chapter 3
10
map, and neural multipliers are far apart from each other on the map. These proximities
on the SOM document map are due to the semantic similarity of hosted documents
mapped to those neighboring neurons. The weight vector of the neuro n is similar to the
vectors of the articles mapped to the neuron, because it represents a average of all
vectors. A set of neighbor neurons whose weights of vectors are very similar make a
cluster.

3.4 How to view clusters

The problem is to discover a pl ace on the map where its neurons contain all the
data from the cluster. We must find the extremities of the cluster on the map. applying the
U-matrix (U -matrix) algorithm to the SOM map.The U -matrix algorithm induces some
gray shades on the map image displ ayed to express how similar or dissimilar are the
adjacent neurons. If the shade of grey is lighter the weights of vectors are similar between
adjacent neurons. If the shade is darker, the dissimilarity is bigger. The more a hill
separating two adjacent cl usters is higher, i.e., the darker, the two clusters between them
are more dissimilar in the multidimensional entry space
For each neuron of the SOM map, we calculate the sum of the disparities between
its weight vector and the weights of the neighboring n eurons in the direction. The metric
of dissimilarity is considered to be the Euclidean distance.

Chapter 4
11
Chapter 4. Analysis and Theoretical Foundation
4.1 Web mining based on self -organized maps
Applying self -organized maps to a data space that consists of natural -language texts is to
do a data mining on data in text format. A self -organized SOM map has the role of
clustering the numerical vectors given as input and producing a topological order. There
is a need to process data items which consist of strings of words . The SOM map will
group the words on categories. We must take into consideration that the phonetic
similarity has no connection to the meaning of words.

4.1.1 Representation of words and documents
We have clustered words with the help of SOM maps. We hav e counted the
appearance of the word in each document from a corpus of documents. Each word is
represented as a vector in a vectorial space R^n.
Most of the time we will get a symmetric situation. The semantic content of each
document is represented as a f unction of the meanings of the words in the document(bag –
of-words). In order to be part of the vocabulary, a word must have at least one appearance
in corpus, meaning at least one appearance in a document from the corpus.
The vector space in which the vect or of each word is defined is a distributed
vector space. T he distributive similarity, which states that the meanings of similar
semantic words are expressed by similar vectors in the distributive vector space. The
vectorial space of documents is the same with the distributive vector space. We must
take into consideration that there are semantic relationships not only between words and
contexts, but also between words and contexts.
In the context of statistical methods based on text mining, the term co -occurrence
is used.When counting the words that appear together with a given word it is said that the
co-occurrences of that word are counted.
There are certain metrics of vector similarity. În order to use word of a vector
representation, we must represent t he meaning of word aș a measure of the use of the
word in different concepts.The semantic category of words gets associated with a neuron
on the map of categories of words. The neuron hosting all words from a category.
If the atributes of the vectorial rep resentation of a word encode grammatical
features of a word, then the SOM map views grammatical categories of words.

4.1.2 Architecture of the web mining system

The architecture of our web mining system is implemented as a pipeline with
multiple stages o f preprocessing and automated learning steps. There are two main
phases, each of them having aș final purpose organizing a self -organized map: a SOM of
categories of words and a SOM document map. The results of the first phase are used
both in the trainin g of the word category map and in the training of the document map.
This is the word counting phase, which comes in direct contact with the text of the

Chapter 4
12
corpus. At this stage, it first comes to the identification of the words, as strings of letters,
in the text documents of the corpus.
Then word counting takes place, resulting in a matrix of pairs of words /
document appearances. This array keeps the number of occurrences of each pair of word
in each document in the corpus. From the pairs of words / docume nts matrix, the vector
representations of all the pairs of words appearing in the corpus and the vector
representations of all the corpus documents are extracted.
The rows of the pairs of words / documents matrix are considered word vector,
and the columns of the same matrix are the document vectors.
Representative semantics learned in the neuron is a mediation of the meanings of
these words, which are otherwise similar to each other(bag -of-words). We need to take
into consideration the dimension of the da ta items. Applications on large textual data sets
reduction to 200 -300 of dimensionality leads to the best results in terms of acquiring the
meanings of words.In a narrower or wider range, there are no more than 200 -300 main
concepts that give meaning to d ifferent words. More then 200 -300 main concepts leads to
noise in semantic acquisition.

4.1.3 Implementation of web mining system
Our web mining system is implemented in C language and bash script. We also
used the LEX lexic al analyzer generator to imple ment the word counting step a step that
reads all the documents in a corpus and counts the occurrences of the distinctive words in
each of these documents.
An important problem was to discover and view clusters of documents through
the SOM document map wi thin our web mining system. We must choose the right size
for the SOM document map to reach a map with the highest visual expressiveness. Map
size means the total number of neurons on its rectangular grid. The granularity of a self –
organized map is defined by the average number of data items hosted in a neuron. If a
map is too small, so if it has too few neurons, then it is too rough. Consequently, such a
map might hide some important differences that need to be detected for cluster separation
purposes. But if a map is too large, then, with too many neurons, it is too detailed. Apart
from the important differences for cluster delimitation, such a map also displays too small
differences that are often not important in clustering.
The accuracy of a cluster is the proportion of documents belonging to the
newsgroup associated with the cluster in all cluster documents. (Figure 3.3.1) And the
cluster coverage is the proportion of cluster -related to a certain group of clusters. (Figure
3.3.2)
correct
accu racy = ––––
predicted
Figure 3.3.1 Accuracy of a cluster

correct
cluster_coverage = ––––
real
Figure 3.3.2 Cluster coverage formula

Chapter 4
13
Self-organized maps are a consistent pattern for Web mining, by defini ng an
overview of a set of Web documents. A SOM document map is an ordered semantic
scattering of documents in the set.
A web mining system, as a prototype system, which carries out the clustering of
text documents in a corpus, based on the criterion of se mantic similarity between
documents.
În order to obtain experimental results în the future, we must use a validation
strategy based on the comparison with a standard metric, e.g the pair accuracy of a cluster
and cluster coverage.
4.2 Self-organized maps g row hierarchically
The main relationship between the terms of a thesaurus is the taxonomic
relationship (is -a).SOM maps have a limited capacity to discover and illustrate
hierarchical clusters in data sets. A solution to this problem is represented by hier archical
SOM maps.

4.2.1 GHSOM -neural network SOM hierarchical

GHSOM is an extension of the automated learning architecture by self -organized
SOM maps. The GHSOM model consists of a set of SOM maps arranged as nodes in a
tree hierarchy. The SOM map from any tree node can grow horizontally during training,
by inserting a new row of neurons or a new column of neurons.
The quantification error of a neuron is the arithmetic mean of the disparities
between its weight vector and the vectors of all neuron -mappe d data items.(mean
quantization error).
The SOM map from any tree node can also grow up vertically during training,
giving birth to daughter nodes. Expanding a neuron on a daughter SOM map works as a
zoom in the data subspace mapped to the parent neuron. Al l the data items mapped to the
parent neuron are propagated to the daughter map. New node expansions can continue
recursively on the daughter nodes, ie on all the nodes of the tree that are on the next level
of depth in the tree. A term specific to neural networks, termination of GHSOM neural
modeling is called convergence.

Chapter 4
14

Figure 4: GHSOM arborescent mode

4.2.2 The training process

At the end of the training, this SOM map remains as a global root of the fully
trained GHSOM arborescent model. SOM mappin g takes place down the tree hierarchy,
starting from the global root, each depth of the tree being considered successively from
top to bottom. For each SOM map, training involves a process where the data from the
training set is mapped to the neurons of th e map through a competitive similarity based
process.
If after this process the mean deviation of the data on the SOM map is above the
threshold t1, then a new row or a new column of neurons is inserted into the map. The
new row or new column is inserted n ext to that map neuron that holds the maximum
quantization error among all map neurons. (neuron error of the map).Looking through
the four neurons adjacent to the neuron error is still looking for the neighbor that is most
dissimilar to the error neuron i n terms of the weight vector.
The weighted vector of each neuron in the inserted row or column is initialized
with the mean of the weight vectors of the two existing neurons, between them existing
this new neuron.
The same data set, following a new traini ng, is now being distributed to a larger
number of neurons by inserting a new row or a new neuron column.
The process is repeated with the two expansion phases followed by re -winding
until the average deviation of the map data reaches the threshold t1. In fact, τ1 is a
fractional threshold of the mean deviation of mapped data in the parent neuron of the
SOM1 map. Thus, SOM mapping ends when inequality is satisfied. As they descend on
levels below in the GHSOM hierarchy, inferior levels explain an ever deeper data
deviation, a deviation that is more and more invisible.

Chapter 4
15
MQE ≤ τ1 * mqe parent neuron

Any current SOM map can have more daughter maps, each daughter having its
origin in a distinct neuron of the map. The birth of a daughter SOM map of a neuron
occu rs only if the threshold t2 is exceeded for that neuron.

mqe i > τ2 * mqe 0

The GHSOM neural network drive process starts from an initial configuration that
consists of a single SOM map consisting of a single neuron that houses the entire set of
input data.
All these daughter SOM maps born in the current training age are on the next
level of depth and will be trained in the next era. Any leaf of the fully trained GHSOM
tree is a SOM map in which all neurons satisfy the threshold t2.

4.2.3The thresholds for convergence

In the GHSOM model, threshold values τ1 and τ2 control the granularity of the
arboreal hierarchy learned from the point of view of depth and branching factor.
If t1 has a large value, then, even after complete training of SOM maps, the
neurons in these t rained maps remain with a large quantization error. Thus, it will be
more difficult to satisfy the threshold t2 that comes into play immediately after the
training of these SOM maps. Also, regardless of the threshold value τ2, always a small
value for τ1 m eans a threshold τ1 more difficult to satisfy. This results in many insertions
of new rows and columns during the training of SOM maps, to the satisfaction of the
threshold t1.
Each tree level representing a learned GHSOM model produces more detailed
clust ering of the data space compared to the parent level. This training set of a SOM map
is also partitioned by drawing the map as a clustering on the maps neurons.This algorithm
corresponds to a downstream process of hierarchical clustering of items in the in put data
space, a process emulated by the GHSOM neural network.
The threshold τ1 is considered to be a fraction of the mean deviation of mapped
data in the parent neuron of the SOM map that trains, a neuron located on a SOM map on
the uppermost level. τ2 is considered to be the fraction of the mean deviation of the ent ire
training set of the GHSOM network, regardless of the level on which the current SOM
map is trained. As the lower GHSOM arrays descend, the data deviation becomes less
and less.
În this way, eventually it arrives at SOM maps without daughter nodes
repre senting the nodes of the final arboresence of a fully trained GHSOM network. This
phenomenon ensures the convergence of the GHSOM neural network by increasing the
arboresence only to a limited depth.

Chapter 4
16
4.3 Enrich -GHSOM neural network
We have extended the GHSOM model, adding the facility to force the growth of
the tree hierarchy from the top down along the branches of a predefined arborescence
hierarchy. The Enrich -GHSOM neural network performs a classification of data items in
a given hierarchical tree str ucture.
The classic GHSOM model was able to grow during training only if it started
from a single node. The descending growth process in our extended model starts from a
given initial tree structure, to which it inserts new nodes attached as successors to any of
its intermediate nodes and leaves. The Enrich -GHSOM neural network has the role of
enriching a predefined initial arboresence with these newly inserted nodes, hence the
name Enrich -GHSOM.

4.3.1. The Enrich -GHSOM Learning Algorithm

The algorithm has two inputs: a predefined initial tree structure and a training data
space consisting of vector -represented data items. Each node of this given arborescence is
represented as a numeric vector defined as the exact data vector of the predefined tag of
the no de.
We adopted the rectangular topology as the SOM map topology for all SOM
nodes of the network, as well as the GHSOM neural network. The training data items are
propagated downward through the given tree structure hierarchy. When the propagation
process reaches a parent SOM parent node of a SOM node from the predefined tree.
The vector weights of neurons without predefined son node on the same SOM
map are also initialized but random values. By classifying training data items on the
initialized neurons of the map. Training data items that are similar to predefined parent
neurons are each mapped to that neuron that is the most similar. In the case of predefined
initialized neurons, it is not about the birth of new daughters of these, but only the
propagatio n of the training process to the pre -existing predefined nodes of these neurons.
Each set of training data mapped into one of these predefined initializers neurons
propagates with a lower level to the SOM node predefined neuron. The training goes
recursiv ely on those self -organized SOM daughters predefined on the next level of depth
în the hierarchy.

Chapter 4
17

Algorithm: Enrich -GHSOM Neural Network Training
––––––––––––––––––––––––––––––––
Input: predefined initial arborette;
training data space;
Output: enriched arborescence;
begin
nivel i = 0
do {
// the training period associated with i:
for all (SOM map s from level i) {
// step 1:
Train the SOM map
// SOM training converges
// by satisfying the threshold τ1.
// step 2:
for all (neurons j of the current SOM map)
if(neuron j has been preconfigured)
Promotes the data set hosted in neuron j
to the predefined daughter map of the neuron j.
} else // Neuron j was initialized randomly.
if(mqe j > τ2 * mqe 0) {
He born of neuron j a new daughter's map,
exactly as GHSOM.
}
}
i = i + 1
} while ( there is at least one SO M map on i)
end
Figure 5 : Enrich -GHSOM Neural Network Training

When tracing a SOM map from the predefined tree array, data items that are not
similar to any of the predefined parent neurons are mapped to other neurons on the same
SOM map. This propagation from top to bottom is not a propagation through the
predefined hierarchy, it makes new nodes which enrich a predefined hierarchy.
In the first phase, all the SOM maps on the age -associated level are trained.
Obviously, only SOM maps in the pred efined tree array can contain neurons that have
been pre -defined. The exit condition from the main loop of the drive algorithm
corresponds to the convergence of the Enrich -GHSOM neural network. The network
converges when the main loop reaches a depth level immediately below the depth of the
deepest Enrich -GHSOM tree.

Chapter 4
18
4.3.2 Arhitecture and implementation

The architecture of our Enrich -GHSOM neural model is implemented as a
pipeline linking several processing phases. Classified data items are the data traine r for
the neural network.
The initial hierarchical tree -level state for the Enrich -GHSOM neural network
must be specified in an ASCII file. At the end of the training process, the GHSOM model
saves in the .unit files the learned tree hierarchy. Enrich -GHSO M uses the same external
format for the final state of the neural network, for the enriched tree hierarchy. So the
format in which Enrich -GHSOM saves its final state is the same as the one from which it
read its initial state. After all, the initial state and final state of the Enrich -GHSOM neural
network are both the same arborescent, with the difference that the final state is the
original arboresence enriched with new nodes.
The idea of introducing the possibility of interrupting the learning process before
the neural network converges. The intermediate tree state in which the neural network
came to us when we interrupted the training is saved in a syntax file. Later, learning can
be continued by restoring the arborescence saved from the file after the interrupt. This
predefined hierarchy for Enrich -GHSOM actually corresponds to an arborescence
hierarchy "already learned in some previous steps". In the classic GHSOM model, the
learning process could always start from only one node, and at the end of th e training,
that node became the root of the learned tree hierarchy.

4.3.3. Internal build -up of the initial arborescent state

The implementation of our model of neural network Enrich -GHSOM is written în
C language. We used the yacc and lex syntactic and lex syntax generators for syntactic
and syntax analyzers to syntax and interpret syntax files. The role of this analysis is to set
the hierarchical neural network internally in the desired, "read" state tree from a .unit file.
Such a .unit file contains an enumeration of all original arboreal nodes as SOM maps.
The second input of the arborescent initial state construction step is precisely the vector
representation of the datasets of the arborescence node labels. These vectors become pre –
defined initial we ighting vectors in the internal representation of the initial arborescence
state.
By sequentially reading the entire description in the .unit file, information about
each SOM node (map) is collected. A node may be the parent of other SOM nodes
originating from a node neuron. During the analysis, each non -root SOM is identified
twice: once its SOM node is read, and once again the predefined initialization neuron is
described from which it originates as a node. All these information are gathered, reaching
to full information about each node. It is included here, for each node, also information
about the parent node and its daughters nodes.Having these information, we build step by
step the entire predefined default arborescent.

Chapter 4
19
The final tree state learned by the Enrich -GHSOM neural network is saved in a
file with the same syntax as the file from which the neural network reads its original
state. The difference between the input and the output file is that the first is the pre –
defined default arbore that must b e enriched, and the second one contains the final
arboresence as a result of enriching the original arboresence by inserting new nodes.
The Input Unit file contains the descriptions as SOM maps of all pre -defined
default arboreal nodes. Each such SOM node contains in its description an enumeration
of neurons. The relevant fragments from the input file from which the internal
representation of the original Enrich -GHSOM arborescence has been constructed. The
map has the size 3×2, so it contains six neurons. T his is the initial size of the SOM map.
It can increase during the training of the SOM map by inserting rows and new columns of
neurons.
Only intermediate nodes of predefined arbors actually contain pre -defined
initialized neurons, since they are parent ne urons for predefined daughters.
$TYPE rect
$XDIM 3
$YDIM 3
$POS_X 0
$POS_Y 0
$NODE_ID 1_1_0/0_0/0
$MAPPED_VECS
$NR_SOMS_MAPPED 0

$POS_X 1
$POS_Y 0
$NODE_ID 1_1_0/0_1/0
$MAPPED_VECS
design#and#usability+design_flaw###major_design_flaw+easy###use+interface+ backlit
###not+quick_spin###dial+button+lock –
out_button###awkward+easy###buttons+weight+bulky###quite+great_feel###weight
$NR_SOMS_MAPPED 1
$URL_MAPPED_SOMS output/cuvinte_2_2_1_0.map

$POS_X 2
$POS_Y 0
$NODE_ID 1_1_0/0_2/0
$MAPPED_VECS
image#quality+blurry _pictures###blurry+excellent###image_quality+resolution+terrible
###grain+4mp_resolution###4mp+light+problems###focusing+awesome###light_auto –
correction+noise+lot###noise+less_noise###less
$NR_SOMS_MAPPED 1
$URL_MAPPED_SOMS output/cuvinte_3_2_2_0.map

Chapter 4
20

$POS_ X 0
$POS_Y 1
$NODE_ID 1_1_0/0_0/1
$MAPPED_VECS
lens+feels###fragile+quality###lens+focus+slow###focus+focus_mode###helpful
$NR_SOMS_MAPPED 1
$URL_MAPPED_SOMS output/cuvinte_4_2_0_1.map

$POS_X 1
$POS_Y 1
$NODE_ID 1_1_0/0_1/1
$MAPPED_VECS viewfinder+obstruc t###viewfinder+awesome###optical_zoom
$NR_SOMS_MAPPED 1
$URL_MAPPED_SOMS output/cuvinte_5_2_1_1.map

Fragment of the input file corresponding to the predefined default arborette

The output file has a structure similar to the input file. In addition to the input file,
it contains in the description of each neuron (at $ MAPPED_VECS) an enumeration of
all data vectors hosted in the neuron following the training process. Also, the weighing
vector of the neuron, resulting from the training, also follows the wei ght adjustments
during training.
The final state is just the arboresence of the initial state, but enriched by the
insertion of new nodes. By comparing the two arborescence states, the enrichment of the
Enrich -GHSOM network is observed, more specifically, the new nodes inserted during
the training are revealed. For each new embedded node, it is revealed the place where it
was inserted, that is, the parent node of the predefined initial arboretum under which the
new node was attached as a child node.

4.3.4 Evaluation of the classification of Enrich -GHSOM

We have implemented a program that compares the two input and output files
respectively. To analyze these syntax files, we used the yacc and lex generators. The
purpose is to find an association, a mapping of each data item to a single node of the
original arboresence, in short, it is found a classification of the data items on the
predefined initial arboretogram.
In the output file, each neuron of each SOM map (node) lists all data items hosted
in the neur on at the end of the Enrich -GHSOM learning process. There is a duplication in
the sense that the subset of data is also stored in the parent neuron. And this happens for
all the SOM nodes of the Enrich -GHSOM arboresence. Because of this, we could extract

Chapter 4
21
the path from each output data file from the output file to the downstream propagation
path through this predefined arbor.
The path starts with the root and ends at the last node of the predefined
arboresence that was chronologically traced by the data item . Obviously, the final node of
the path is also the deepest node in the predefined arboresence between the nodes of the
data item. It can be an intermediate node or a leaf node of predefined arboresence.
In the course of downstream propagation through the Enrich -GHSOM
arboretogram, each data item first starts from the root only the SOM nodes of the
predefined arboretogram of the Enrich -GHSOM, and then leaves the predefined
arboretogram. The important thing is when the data item leaves the predefined
arbores ence, meaning it is important that the last SOM node has gone through, which still
belongs to the predefined Enrich -GHSOM arboresence. The program that compares the
input and outbound file returns a list of pairs (tag_de_date, label_nod), that is, the
association of each data item to the node to which the item has been categorized.

4.3.5 .Transformation from is format to unit format
The Enrich -GHSOM neural network input file allows convenient tree hierarchy
specification. But for trees with a large number o f knots, manual editing becomes
cumbersome and time consuming. It is the case of trees having about 20 knots. Thus, it
has become necessary to automatically generate this input file containing the predefined
initial tree array state.
Taxonomies against whi ch new terms are classified are given as ASCII files
containing is_a (concept, concept_parinte) assertions. Such a taxonomy is a tree structure
that becomes the initial arborescent state read by the Enrich -GHSOM neural network.

Example :
is_a(feline, mamm al)
is_a(bear, mammal)

We have implemented, using yacc and lex, an on -line translator to convert
arborescence taxonomies to any size in the specific input format that the Enrich -GHSOM
neural network can read. The taxonomic tree is -a – is converted to an external neural
representation, as the initial state in external format for the hierarchical self -organized
map. The external format of the initial state is the .unit format. The translator in the unit
performs a symbolic -neural conversion.

After analyzin g the file containing the is -tree arboresence, it is internally built a
multicopy that contains all the nodes referenced in the is -iss. Scroll down this simple tree.
Other information that must be present in the input entry format where the arboresence is
converted is also deduced during the browse. The SOM map has x -row columns and
contains an enumeration of its neurons, each neuron with its coordinates on the map. The
predefined SOM daughter corresponds to the simple node tree that exactly labeled the
initial initiator tag of the predefined parent neuron of the SOM daughter. The set of

Chapter 4
22
predefined parent neurons functions as an index list of daughter nodes. The dimension, ie
the total number of neurons of the current SOM map is set at a value proportional t o the
predefined predetermined neurons (predetermined parents), that is, the number of sons in
the simple tree of node corresponding to the current SOM map (node).
The Enrich -GHSOM neural network, is an extension of the GHSOM hierarchical
self-mapping mode l.It is a new neural model that performs a hierarchical hierarchical
development starting from a predefined initial arboresence. In other words, the automatic
learning process of the neural network acquires new knowledge based on the background
knowledge r epresented by the original arboresence. Our new neural network model thus
makes a major contribution to neural network algorithm.
4.4. Text -based enrichment of taxonomies
Ontology is defined as a formal model for a domain, that is, a formal and explicit
specification of a conceptualization.

4.4.1. Learning ontologies

It requires human supervision to a great extent. The field of ontology learning
focuses on various aspects of natural language processing, data mining and web mining,
automated learning and kn owledge representation. Learning ontologies is also known as
the ontology generation, ontology mining, or ontology extraction. Some techniques are
building a 'zero' ontology or enriching or adapting an existing ontology. Adaptation of an
ontology can also mean aligning it with other ontologies
The aim is to enrich the taxonomic structure of ontologies by inserting new
concepts. Whether new concepts or new instances are added in a given ontology, the
process is the same from an algorithmic point of view. The re are several advantages to
the idea of enriching an existing ontology to creating a completely new ontology from
scratch. First, better results are obtained on the quality of the ontology obtained if it starts
from an original, well -established ontologi cal structure, so correct. And the new concepts
that are inserted into the original ontology are forced to adhere to its taxonomic branches.

4.4.2. Taxonomic enrichment based on text

In this thesis we focus only on enriching the taxonomy of a given ontol ogy. The
hierarchical backbone structure of the existing ontology is enriched, ie its taxonomy, with
new concepts specific to a domain, concepts extracted from a specific corpus specific
domain. Practically, only new concepts are not inserted. New concepts are introduced, but
it is only taxonomic relations.

Chapter 4
23

Figure 6: Taxonomy based enrichment system

The text -based enrichment process of the taxonomy of a field ontology has two
inputs: an existing taxonomy – which plays the background of background knowled ge or
previous knowledge – and a textual corpus specific to a domain. The purpose of this
taxonomic enrichment is to automatically adapt the given taxonomy through the domain
specific domain.
Any new node inserted during the taxonomic enrichment process ma y become the
son node either for an intermediate node or for a leaf node in the given taxonomy.
There is a classification of the terms extracted in relation to the taxonomy of a
given ontology, by associating each term to one target node of the taxonomy. T he
association is based on similarity, according to a similarity metric defined in the
distributive vector space.

4.4.3 Other approaches to enriching taxonomies

There are four main categories of approaches to taxonomy enrichment: methods
based on distrib utive similarity and classification of terms in an existing taxonomy,
approaches using lexico -syntactic templates, linguistic approaches, and finally, combined
methods. Our approach to enrichment belongs to the first category.

4.4.4 Classification of term s based on distributive similarity

Classification of terms on the nodes of the given taxonomy is done according to a
similarity metric defined in a distributed vector space Rn. It is about the similarity
between a term to be classified and a target (taxon omic) node to which the term can be
associated by classification. In the distributive vector space, each term is represented as a
vector containing information about the various contexts in which that term appears in
the corpus.

Chapter 4
24
In a top -down version of th is classification, there is a downward search on the
existing taxonomy for finding a target node to associate with each new term. The search
is based on the similarity between the new term and the concepts in the taxonomy nodes.
The target node is found as the node whose concept is most similar to the new term. The
term will be inserted below its target node as its direct successor.
Classification is conducted by the Enrich -GHSOM neural network. Following
classification, each term attaches itself as a hypon ym for a node in the taxonomy, which
node can be both intermediate node and leaf.
If in the Enrich -GHSOM some neurons are initialized1 predefined and others are
randomized initially, in the GHSOM all neurons on all SOM maps are initialized
randomly.
When t he training data set is noisy, as is the case with our experiments, the
GHSOM neural model is more powerful than the hierarchical clustering algorithm in
terms of speed, noise tolerance, and ruggedness. The noise in the datasets that we classify
in our exp eriments is due to the large dimensionality of these data. This means a large
number of data items embodied as linguistic entities of the nominal group type extracted
from a body of texts.
There were representations as multidimensional vectors for these da ta items. In
our experiments, the number of dimensions, that is, the attributes of the vectors, is also
very large, being equal to the number of corpus documents from which all of these
linguistic data items have been extracted.
However, the GHSOM model is more mathematically complex than hierarchical
clustering. Self -organized SOM maps and, implicitly, the GHSOM neural network have
many more parameters

4.4.5 Degree of automation

Most approaches require human intervention throughout the learning process t o
help correctly place new concepts inserted into the taxonomy. Our taxonomic enrichment
system is unsupervised, with the exception of a possible manual deforestation of the final
taxonomy enriched. Definitions means removing from the taxonomy concepts tha t are not
specific to the domain.
4.5 A neural model for unsupervised taxonomy enrichment
The unsupervised enrichment of taxonomy (ontology) means that it is map new
terms, find (identify, extract from text) in text, on the nodes of existing ontology, whic h
meant that those terms would be (they will insert) as daughter nodes. those nodes on
which they were mapped. This insertion meant enrichment with new nodes.
Now we find (find) in pairs text, and these pairs are mapped on the nodes of
ontology aspects / o pinions, it would be ideal to map those leaves that express positive or
negative opinion on the result expressed in the parent node. So no nodes are inserted new

Chapter 4
25
(as to the enrichment of ontologies) but just interested in mapping went right, on those
posit ive or negative opinion leaves.
Linguistic analysis and pair identification it is made by Stanford parser.And the
enrichment phase is new mapping and then seen as far as mapping is correct, that is,
those positive and negative opinion leaves that really sh ould be mapped.
Reduces the number of zeros (making the center, that is, the average vectors), but
the most important thing is that the centroid allows the construction of vectors associated
to the intermendial nodes of ontology of issues / opinions. For a ny intermediate node is
computed centroid vector (ie vector arithmetic mean) of all nodes under it (from all its
sub-branches), including the leaves. Only the leaves have a start vector (intermediate
nodes do not have vectors initial, for that it is not so mething extracted and counted as a
vector, text).
Instead, the leaves express opinions (a positive and negative one) on the aspect
that is the father of the two leaves, and the vector the leaf is given by the vector of a pair
(extracted from the corpus) th at I have considered to express a positive (or negative)
opinion about a certain aspect.
An example :
worrying###charging
worrying###without
wrist_strap###neck_strap
wrist_strap###wrist_strap
wrist_strap###a
customizable###beautiful_design
contours_ right###contours
contours_right###the
intuitive###but
exposure_control###exposure_control
exposure_control###programming
exposure_control###like
exposure###the
unresponsiveness###its
most_purposes###most

The system is capable of enriching the taxo nomic structure of a given ontology. It
is implemented as a pipeline with several stages of linguistic processing and learning
steps.
There are two main phases: Phase Extraction Phase and Phase Enrichment Phase.
An intermediate phase, which measures the ex tracted terms. Sucession of major
processing phases in our taxonomic enrichment system: The phrasing phase consists of
three phases of linguistic processing,, the counting phase of the terms. Its purpose is to
build a proper vector representation for each extracted term so that the terms can be
classified by the Enrich -GHSOM neural network in the taxonomy enrichment phase. And
the taxonomy Enrichment Phase is an automated learning process based on the Enrich –
GHSOM neural network.

Chapter 4
26
Two inputs: an existing tax onomy and a domain -specific textual corpus. Exit of
the system is a taxonomy obtained by enriching the existing taxonomy, given as input.

Figure 7:Enrich -GSOM from input to output

4.5.1 Extraction of terms

The phrasing phase is based on a linguistic a nalysis of a body of text specific to a
domain. The names of the new concepts inserted are terms representing nominal groups
(substantive groups). These linguistic structures are identified by a data mining process
applied to the corpus of domain -specific texts. Our taxonomic enrichment system is
based on several processing resources offered by Stanford Parser.
Natural language processing modules are used to implement generic algorithms to
process texts in natural language. The only condition imposed on the se processing
resources is that the texts are written in English.
Used in linguistic analysis are the following: a token -based word breaker
(tokenizer), a splitter system, and a part -of-speech tagger system. The speaker part finds
and assigns to each word the right side of the speech, that is, the syntactic category to
which the word should correctly be. We have designed some regular expressions defined
as a function having as arguments the parts of words that can form a correct nominal
group.

Chapter 4
27
There are two phases chained into a pipeline. First, in a linguistic analysis phase,
annotation of the corpus with linguistic information takes place with the help of
processing resources. Then, in the second phase, the terms are identified. In this second
phase some e xtraction rules apply to the linguistic information discovered in the first
phase.
There is a chain of chained processes operating on all the documents in the
corpus. Each pipeline process writes (marks) linguistic information inside text documents
in the form of annotations. An annotation is a description that is added, attached to a
particular segment of the text being analyzed. It can be attributed to a word, a sequence of
words , a phrase or a paragraph. No process removes existing annotations. After th e last
pipeline process, the corpus documents will contain all the annotations written by all
processes cumulatively.

4.5.2 Linguistic analysis

The most important step is to annotate the words with the parts of the speech. It is,
in fact, the last step o f the linguistic analysis and is based on the information provided in
the previous steps of this analysis as a pipeline. The first step is the lexical analyzer
(tokenizer). It scrolls the text and determines elementary text blocks of structured text, ie
tokens. A document represented as a text file is decomposed into tokens. Tokens are the
fundamental symbols, ie the basic lexical units that make up a text in natural language.
They are atomic, that is, they do not have a structure in which they can be broke n down.
A word is a string of letters that may also contain the hyphen (dash), but it can
not contain other punctuation marks. A number is a string of digits. The Speech Parts
Adapter assigns to all numbers the numerical part of the speech. Special symbol s are
currency unit symbols (for example, $, ₤, €) and other special characters like &, #.The
punctuation marks are round brackets and other characters known as such (semicolon,
two points, semicolon, question mark, exclamation mark, suspension points, dash, quotes ,
apostrophes, etc.) A space is a string consisting of only blank, tab, and newline. In most
languages, spaces and punctuation marks play the role of separators between the other
major lexical units, namely, words, numbers and special symbols.
The second s tep in the pipeline is another resource called the splitter sentence. It
segments the text into phrases, annotating in the text the beginning and end of each
sentence. Phrase delineation annotations are required for the next step in the pipeline, that
is, for the speech part annotator.
Annotator of part speech parts is the last and most important step of the pipeline
representing linguistic analysis is the part -of-speech part of the pipeline.
A word marker annotator assigns to each word a word representing the part of the
word spoken. The annotator uses the linguistic information from the previous pipeline
step, namely the segmentation of the text in phrases. This is because the identification of
the parts of the speech is closely related to the position of the words in the phrases. An
annotator of speech parts is language dependent. Such a tagger knows how to correctly
identify the part of speech of each word in a text document written in a certain language
only after he has been pre -trained on a large and r elevant corpus of texts written in that
language. The accuracy of an annotator trained in an unsupervised manner is weaker than
in supervised training.

Chapter 4
28
Each word in the corpus already has a tag associated with the correct word of the
word. The Brill Knob l earns an annotation rules in two steps. In the first step, starting
with the annotated corpus, general rules are learned that predict the most appropriate part
of speech for an unknown word, based on annotation frequency with a particular label in
body tex t.
Once such a general rule is applied to an unknown word, it applies to all
occurrences of that word in the text that is annotated. In step two, the annotator teaches
context rules to improve their accuracy. Change the word word from a verb to a noun
when the word immediately preceding in the text is an adjective. Unlike general rules,
context rules operate at the level of each occurrence of a particular word in the annotated
text. Different occurrences of the same word in a text may appear in different sy ntactic
contexts.
The speaker part annotator annotated every word in the sentence as below.
Example : battery life is over 4.5 hours, compared to about 2.5 hours for the
g2 or the 2 hours for most nikons.
This means he has identified two verbs(is,compared ), seven nouns(battery, life,
g2, nikons, hours which appears three times), adjective (most), prepositions(over, for
which appears twice), two appearances of the, numerator(4.5, 2.5),coordinating
conjunction (or).
battery/NN life/NN is/VBZ over/IN 4.5/CD h ours/NNS ,/, compared/VBN
to/TO about/RB 2.5/CD hours/NNS for/IN the/DT g2/NNS or/CC the/DT 2/CD
hours/NNS for/IN most/JJS nikons/NNS ./.
Although the POS tagger is trained for English, he has, however, discovered a
preposition in several European language s, including Romanian, namely the preposition.
They also encode certain morphological information. The first letters of the code indicate
the speech portion itself, and the last letters provide morphological information or other
additional information. The annotations shown in the example above above appear as
XML style annotations.

4.5.3 Identifying terms

Language templates apply to the domain corpus. Nominal groups (noun phrases)
are the parts of the text in which there is the greatest likelihood of fin ding candidate
concepts for ontology. Identifies linguistic structures from the syntactic category of
nominal groups. A program that identifies certain syntactical structures in the text by
applying such templates is a chunker. The search for terms is prov ided by a chunker for
structures in the nominal group syntactic category (noun phrase chunker). The Chunker
identifies the nominal groups in texts already annotated with parts of speech in the
previous phase of the pipeline, that is, in the linguistic anal ysis phase.
Nominally group identification templates indicate both the parts of the words that
can form a nominal group and the order in which these component words are allowed to
appear in the text. (superficial syntactic templates).
The determination of the parts of speech of the words in the documents, made in
the linguistic analysis phase, is only a superficial, robust syntactic analysis. It is
independent of the full, deep syntactic correctness of the analyzed texts. Superficial

Chapter 4
29
syntactic analyzers are the most common choice in retrieval of information and generally
in text mining based statistical methods.In this representation, it does not matter the
syntax or order in which the words appear in a document but the actual presence of those
words in the document.
Templates for identifying terms as nominal groups check the order in which the
words of a nominal group should appear correctly in English. They even check this order,
even if they can not find any syntactic tree structure of the nominal groups. The order of
words having certain parts of speech in a nominal group is defined by means of regular
expressions over accepted parts of speech. The syntactic depth analysis performed by
such a chunker would be modeled as a stack [LC98]. A finite automaton t hat models a
robust parser is more computationally efficient than an automated stack that shapes a full
depth parser. The most common solution when analyzing texts in natural language in the
field of retrieval of information and generally in text -based min ing methods. From a
computational cost point of view, as large volumes of texts are analyzed in these areas
and consequently a reduced computational cost becomes essential.
The rule that consists of a regular expression and action. Regular expression is th e
left side of the rule and plays the role of a shallow template for identifying nominal
groups based on words already annotated in the text with speech parts. Annotation took
place in the previous phase, linguistic analysis. The right part of the rule cre ates the
annotations that delimit the beginning and end of the text in each of the nominal groups
identified by the regular expression on the left of the rule.
Capitalized regular expressions are macros, similar to the definitions that appear
in regular ex pressions for the LEX lexical analyzer generator [LS75, LC98]. Nouns in
English are word sequences that begin with zero or more determiners, identified by the
party (DET). Determinators are followed by zero or more adjectives, nouns and
possessions in any order. It is the attribute string that refers to the basic noun of the
nominal group.
compound(life -2, battery -1)
nsubj(hours -6, life -2)
cop(hours -6, is-3)
case(hours -6, over -4)
nummod(hours -6, 4.5 -5)
root(ROOT -0, hours -6)
case(hours -12, compared -8)
mwe(co mpared -8, to-9)
advmod(2.5 -11, about -10)
nummod(hours -12, 2.5 -11)
advcl(hours -6, hours -12)
case(g2 -15, for -13)
det(g2 -15, the -14)
nmod:for(hours -12, g2 -15)
cc(hours -12, or -16)

Chapter 4
30
det(hours -19, the -17)
nummod(hours -19, 2 -18)
advcl(hours -6, hours -19)
conj:or(ho urs-12, hours -19)
case(nikons -22, for -20)
amod(nikons -22, most -21)
nmod:for(hours -19, nikons -22)

4.5.4 Counting Terms

Intermediate phase, counts the extracted terms. It is intended to build a proper
vector representation for each extracted term so that the terms can be classified by the
Enrich -GHSOM neural network in the taxonomy enrichment phase. Multiple occurrences
of the same term are counted in each body document. For each term, a vector -style bag –
of-words style is constructed. Each term is encoded with contextual content information
in a distributed vector space Rn.
The identification and counting of each occurrence of a particular term
linguistically represented by a nominal group takes place on the corpus documents
already passed through the term identification phase. The monitoring is provided by an
analyzer, implemented using the lex generator, which reads the annotated documents
saved in XML format after using Stanford Parser. It's a robust analyzer, which only
collects important information fro m these XML documents that make up the annotated
corpus. Important information is the sequence of words that make up the various nominal
groups that appear in the documents.
The information on the number of occurrences of the same term in each body
documen t is handled in a spreadsheet. Each entry in the spreadsheet represents a term for
which a list of occurrence counters of the term is stored and updated in the various body
documents.

4.5.5 Taxonomy Enrichment

It is considered the phase extraction, extr action, or taxonomy building. The term
extraction phase is relatively standard across different approaches and tools for learning
or enriching ontologies, the extraction or enrichment phases of the taxonomy differ from
one approach to another. Our approach to enriching taxonomies is based on distributive
similarity and classification of terms in an existing taxonomy.
This phase involves a process of acquiring knowledge based on automatic
learning. As a solution, we have chosen our own extended model of self -organized
hierarchical maps – Enrich -GHSOM. Therefore, automatic learning is based on an
unsupervised neural network. Choosing a hierarchical neural network is well suited to the
structure of knowledge that is enriched – a taxonomy, that is, a hierarchy o f concepts.

Chapter 4
31
The taxonomy enrichment algorithm is carried out by "populating" the taxonomy
given to the terms collected in the corpus. The Enrich -GHSOM neural network conducts
a downward hierarchical classification of terms along the branches of the given
taxonomy. New concepts (nodes) are inserted for these classified terms. Each new
concept attaches itself as a child node of an intermediate node or a leaf node of the given
taxonomy and thus becomes a direct hippon of that target node.
The two inputs are th e given taxonomy, expressed in is -format, and the vector
representation of the extracts from the corpus, and the output is the enriched taxonomy,
expressed in .unit format. The terms extracted from the same corpus are both the terms
that give names to the concepts of the given taxonomy and the terms that are categorized
to enrich the given taxonomy. Classified terms become new concepts, new nodes in
enriched taxonomy.
Label data items for the given tree nodes become terms that give names to the
concepts of the given taxonomy. The names of the concepts and terms to be classified
have a common origin and are of the same nature, because they are terms extracted from
the same corpus. We started from a more general hypothesis, namely that tagged data
items for th e given tree nodes are like classified items. There is not necessarily a common
origin for knot label articles and for classified items.

Chapter 4
32

Figure 8: Neural Model Enrich -GHSOM

First there is a symbolic -neural translation by analyzing an external textu al
representation of the given initial taxonomy. This textual representation has the form is_a
(concept, concept_parent) or it can be an OWL file. Thus, for the given taxonomy,
representation in the Enrich -GHSOM neural network input format is generated. Th e
internal representation of the arboreal initial state of the neural network is further
obtained by analyzing this representation of the taxonomy in the input unit format.
In order for the initialized network to be able to classify terms in this initial
taxonomic structure, besides the vector representation of the terms to be classified, a
representation as a numerical vector for each node of the given initial taxonomy is also
required. This node vector plays the role of a predetermined weighting vector ve ctor for
the neural network SOM nodes. The vector representation of the node concept tag, the tag
being a nominal group, gives the name of the concept. The acquisition of this vector takes
place in the same way as the acquisition of the vector representati on of the terms to be
classified. The common origin of the terms that give the names of the concepts on the one
hand and the terms to be classified on the other, all of them being terms extracted from
the same corpus. The data item vectors that are node la bels, in the present context the

Chapter 4
33
vectors of the terms that give the name of the concepts, are used as input for the stage of
building the initial tree state. This stage is part of the Enrich -GHSOM neural network.
The data item vectors to be classified, in the context of the vectors of the terms to be
classified, are input for the next step of the Enrich -GHSOM neural network, namely for
the Enrich -GHSOM neuronal classifier.
The vectors of these labels are obtained by the same calculation method used in
obtai ning the vectors of all terms extracted from the corpus and then subclassified into
the taxonomy enrichment process. Using the same corpus in a specialized domain to
acquire the attributes of the concepts vectors in the given initial taxonomy and the vecto rs
of the terms to be classified is a reasonable option, since the ambiguous meanings of one
and the same classified term are diminished. Ambiguity means a multiple classification of
that term, that term can be associated with at least two distinct target concepts in the
given initial taxonomy. We have several newly introduced concepts, all of which have the
same lexical ambiguity as concept names. Each concept attaches as a hippopotamus node
directly to one of the target nodes. The semantic focus induced b y the use of the same
corpus for collecting both the terms of classification and the labels of the concepts of the
given initial taxonomy reduces this phenomenon of semantic ambiguity. In the focused
semantic field, only one of the multiple meanings of a t erm to be classified, namely the
specific meaning of the corpus focus, remains valid. It fits semantically with the
meanings of the terms that give the names of the concepts of the given taxonomy, terms
extracted in their turn from the same specialized bod y. It makes a classification as a
semantic correct and unambiguous association from the terms to be classified to the
concepts of the given taxonomy.

4.5.6 Centroid Vector

One way to reduce the number of zeros in the vector representation of generic
terms that label the generic concepts in the given initial taxonomy is to use the centroid
vector. Vector medium: vector media representing all concepts in the sub -tree whose root
is the given concept, including the root itself. The C centroid vector of a C co ncept
according to the formula where Ci are the concepts in the sub -tree root having the node
C, the sub -tree (N) of node N, node number (N) is the number of nodes of the sub -tree
dominated by a node N, and Ci is the tag vector that gives the name of the c oncept Ci.

ΣCi  subtree(C) Ci
C = ––––––––––
no_nod es(C)

Figure 9 Centroid vector

Chapter 4
34
In the vectorial representation of a more generic concept, with fewer occurrences
in the corpus, we also took into accoun t the corpus appearances of all its successors in
taxonomy. These successor concepts from the entire subgroup of the more generic node
have more and more frequent occurrences in the corpus, as the depth increases in the
subsurface. The centroid method redu ces the number of zeros representing the occurrence
frequencies of the concepts in the various documents in a corpus.
Some of the generic concepts do not appear at all in the corpus. Such concepts
must be represented vectorally. A distributive statistical vector can still be constructed for
the concept, because the corpus, being on the specific domain of the concept, still
contains appearances of concepts in the sub -framework of the given generic concept. As
they descend to the leaves in the sub -tree, more and more specific sub -tree concepts are
expected to occur more and more frequently in the corpus. The centroid of a concept
without corpus appearances is calculated as the average of the vectors of those concepts
in its sub -tree that have appearances in t he corpus, so they have a statistical distributional
vector.
The translator in the unit adds to the tag of the given node the labels of all the
direct and indirect successors of the node, collected from the taxonomic representation is.
The result is a set of labels that associates with the node. In the representation converted
to the Enrich -GHSOM neural network input format, the default initialization tag of any
taxonomic node is actually that set of labels associated with node1. The set contains the
tags o f all nodes in the sub -tree root having the given node, going to the sub -tree leaves
and including the root itself.
The initial arborescent state build stage analyzes and interprets the representation
of the initial taxonomy already converted to the incomi ng input format. For each
taxonomic node the average (centroid) vector of the vector representations of the tags in
the set associated with the node is calculated. The translator in the unit is also able to
limit the collection of the direct and indirect s uccessor tags of a node to the maximum
depth of these successors in the subroutine of the node. The purpose of this limitation is
in fact to restrict the calculation of the centroid vector of a node to a vectorial medium of
only the successors up to a maxi mum set depth.
A taxonomic enrichment system as a prototype system. The system is
parameterized from the point of view of the linguistic category of the textual units that
are classified also from the point of view of the vector representation of the clas sified
units and the concepts of the given taxonomy. Our taxonomic enrichment system is easy
to expand and adapt along these parameters.
Noise tolerance. Based on a neural network, the taxonomic enrichment system
is tolerant to noise. This is an import ant feature as the system is based on a mining
process on uncontrolled data sets and high dimensionality. Such sets of data consist of
thousands of web documents.

Chapter 5
35
Chapter 5. Detailed Design and Implementation
5.1 Lex and YACC

The purpose of our application is to create GHSOM.unit file, which is a
experimental scientific prototype. We are going to use yacc and lex.

Figure 10: Lexx and Yacc representation

Yacc is a parser , while Lex is a lexical analyzer . They are typically used
together: you Lex the string input, and YACC the tokenized input provided by Lex. Now,
a regular expression can only represent regular languages. One of the constraints of a
regular language is the lack of "memory". Lex’s main job is to break up an input stream
into more usabl e elements.Or in, other words, to identify the "interesting bits" in a text
file. Yacc’s job is to analyse the structure of the input stream, and operate of the "big
picture".
In the YACC file, you write your own main() function, which calls yyparse() at
one point. The function yyparse() is created for you by YACC , and ends up in y.tab.c.
yyparse() reads a stream of token/value pairs from yylex(), which needs to be supplied.

Chapter 5
36

Figure 11 Compilation sequence

The patterns in the above diagram is a file you create with a text editor. Lex will
read your patterns and generate C code for a lexical analyzer or scanner. The lexical
analyzer matches strings in the input, based on your patterns, and converts the strings to
tokens. Tokens are numerical representation s of strings, and simplify processing. When
the lexical analyzer finds identifiers in the input stream it enters them in a symbol table.
The symbol table may also contain other information such as data type (integer or real)
and location of each variable i n memory. All subsequent references to identifiers refer to
the appropriate symbol table index. The grammar in the above diagram is a text file you
create with a text edtior. Yacc will read your grammar and generate C code for a syntax
analyzer or parser. The syntax analyzer uses grammar rules that allow it to analyze tokens
from the lexical analyzer and create a syntax tree. The syntax tree imposes a hierarchical
structure the tokens. For example, operator precedence and associativity are apparent in
the s yntax tree. The next step, code generation, does a depth -first walk of the syntax tree
to generate code. Some compilers produce machine code, while others, as shown above,
output assembly language.

Chapter 5
37

Figure 12 Building a compiler with lexx / yacc

Figure above illustrates the file naming conventions used by lex & yacc. We'll
assume our goal is to write a BASIC compiler. First, we need to specify all pattern
matching rules for lex (bas.l) and grammar rules for yacc (bas.y). Commands to create
our compiler, bas.exe, are listed below: yacc –d bas.y # create y.tab.h, y.tab.c lex bas.l #
create lex.yy.c cc lex.yy.c y.tab.c –obas.exe # compile/link Yacc reads the grammar
descriptions in bas.y and generates a syntax analyzer (parser), that includes function
yypar se, in file y.tab.c. Included in file bas.y are token declarations. The –d option causes
yacc to generate definitions for tokens and place them in file y.tab.h. Lex reads the
pattern descriptions in bas.l, includes file y.tab.h, and generates a lexical ana lyzer, that
includes function yylex, in file lex.yy.c. Finally, the lexer and parser are compiled and
linked together to create executable bas.exe. From main we call yyparse to run the
compiler. Function yyparse automatically calls yylex to obtain each tok en.
During the first phase the compiler reads the input and converts strings in the
source to tokens. With regular expressions we can specify patterns to lex so it can
generate code that will allow it to scan and match strings in the input. Each pattern
specified in the input to lex has an associated action. Typically an action returns a token
that represents the matched string for subsequent use by the parser. Initially we will
simply print the matched string rather than return a token value.

5.2.Stanfor d Parser

The Stanford Parser is considered to be a statistical parser. Its aim is to process the
grammatical structure of sentences. There are certain groups that make a pair, and
which objects are the subject and the object of a verb. Probabilistic pars ers use
knowledge of language gained from hand -parsed sentences to try to produce the most
likely analysis of new sentences. These parsers may be improved because they do not
provide the best solution. In the 1990s it was considered a breakthrough in natur al
language processing.
For part -of-speech tags and phrasal categories, this depends on the language and
tree bank on which the parser was trained. The parser can be used for English and many
other languages.
In our program, we have taken a xml file which contains the sentence:

Chapter 5
38

“but at the same time , it takes wonderful pictures very easily in " auto " mode , so that
even an average joe like me can use it ! ”

Figure 13: xml file from after a fter applying the Stanford Parser,
After applying the Standford Parser, first we obtain the tagging part. The tagging
part contains each word from the sentence, followed by a “/” and after dat the POS,
specific to the word. POS meaning part of speech. Here is a list of all posibile POS that
may be found out by Standfo rd parser:
CC Coordinating conjunction
2. CD Cardinal number
3. DT Determiner
4. EX Existential there
5. FW Foreign word
6. IN Preposition or subordinating conjunction
7. JJ Adjective
8. JJR Adjective, comparative
9. JJS Adjective, superlative
10. LS List item marker
11. MD Modal
12. NN Noun, singular or mass
13. NNS Noun, plural

Chapter 5
39
14. NNP Proper noun, singular
15. NNPS Proper noun, plural
16. PDT Predeterminer
17. POS Possessive ending
18. PRP Personal pronoun
19. PRP$ Possessive pronoun
20. RB Adverb
21. RBR Adverb, comparative
22. RBS Adverb, superlative
23. RP Particle
24. SYM Symbol
25. TO To
26. UH Interjection
27. VB Verb, base form
28. VBD Verb, past tense
29. VBG Verb, gerund or present participle
30. VBN Verb, past parti ciple
31. VBP Verb, non -3rd person singular present
32. VBZ Verb, 3rd person singular present
33. WDT Wh-determiner
34. WP Wh-pronoun
35. WP$ Possessive wh -pronoun
36. WRB Wh-adverb
Figure 14: Alphabetical list of part -of-speech tags
Also I have work ed with the Stanford CoreNLP 3.9.1, which can be found as a
demo version online. If you put the sentence from above, we obtain a graphical result of
the parsed sentence. Here we can visualize the enhanced relations of the words and its
pos. Our program als o dicovers them in the 2nd part of the program. We have found out
the relation(the word and its position, the dependent and its position) e.g amod(pictures –
10, wonderful -9). But with the help of CoreNLP there is a posibility to actually see the
relationshi ps between words. This demo has encountered more relations than our stanford
parser.

Chapter 5
40
Figure: 15 Enhanced Dependencies resulted after applying Stanford Parser

5.3 Implementing Lexers and Parser

Dependente.y file
We have the function create_word_with_pos _tag() with the parameters word and
pos_tag which we allocate dynamic memory. All the files are considered, there are
several files, and each file is in the phrase and relations in the sentence. Each phrase is
recursive, that is a wordlist/pos_tag. The re lations are taken recursively.
Lex leaps very little here. Yacc is the one who does all the work. Another
important function here is add_word_with_pos_tag(). In the declaration part it is seen as
$2. Tagged word is a pointer to a structure. In the implemen tation part it puts the
pos_tagged_sentence in the array of the add_word_with_pos_tag() function which has the
argument $$ . Each time a new word is inserted, the counter is incremented e.g i++.
Tagged word is a pointer to a structure and it is inserted in pos_tagged_sentence, each
time appears a new word.
File which each one contain a file, and each file is a sentence and then
relationships are taken into considerations. We ended with sentence which is recursive a
pos_tagged_word (1,2,3), mentioning that p os_tagged_word means WORD '/' POS_TAG,
the word and its part of speech. We have sentences and follow relationships, and
relationships are recursive.

rel : RELATION '(' WORD ' -' NUMBER ',' WORD ' -' NUMBER ')'

Going further to its declaration.Further on w ith strcmp function we found out if
pos_tagged_sentence is JJ or NN, JRS. The purposes of strcmp is to compare the string
pointed to, by pos_tagged_sentence to the string pointed to by JJ, NN, or JRS.
The function count_pair() , counts the pairs after iden tifying them in the
count_pair function. In the function, the array starts only from 1 inwards and counts them
as positions. Usually specific to C language a simple array starts from 0, I should only
start from 1 in this one. If it's on a 0, it will resul t in segmentation fault, there's nothing in
it there and you won’t be able to parse further. That's why I make sure that the
head_position bigger than 0. Head_position being considered the third argument of
count_pair.

Chapter 5
41
count_pair($1, $3, $5, $7, $9)
count_ pair(char *relation, char *head, int head_position, char *dependent, int
dependent_position)
The value corresponding to the token DIGIT, that was assigned to yylval by lex is
accessed in YACC using the symbol $1. Recall that values corresponding to the sym bols
in the handle of a grammar may be accessed using $1, $2, etc according to its position in
the production rule.
The attribute value of 'relation' is accessed by the symbol $1, value of ‘head' by
$3 and 'head_position' can by $5, dependent can be by $7, dependent_position can be by
$9. The symbol $$ refers to the attribute value of count_pair() which is the head of the
production. Note that the head of a production must be a non -terminal. Hence, it becomes
possible to assign an attribute value to the he ad of a production by assigning a value to
$$. In the above example, an attribute value can be assigned to count_pair() through an
assignment to $$ . In the declaration part of dependente.y we have the following:
files : files file
| file
file : sent ence relations
relations : relations rel
| rel
rel : RELATION '(' WORD ' -' NUMBER ',' WORD ' -' NUMBER ')'
{ count_pair($1, $3, $5, $7, $9); printf("%s(%s -%d, %s -%d)\n", $1, $3, $5,
$7, $9); }
sentence : sentence pos_tagged_wor d { add_word_with_pos_tag($2); }
| pos_tagged_word { add_word_with_pos_tag($1); }
pos_tagged_word : WORD '/' POS_TAG { $$ = create_word_with_pos_tag(/*$1, */$3); }

If the position of head is 0, I do not go to tagged_head_position, pos_tagged_head
(position) is the same with the head, with the word with the string. pos_tagged_sentence
containing the string of words. The position head_position must be exactly that word. →
word is the word. → pos part of speech. If somehow head is 0, head just displays it.
We wanted to check if the head is the same pos_tagged_w ord (head_position).
And it's the same, because head_position is that head. head_position → pos just the word
head. That's what I wanted to check if it's ok, if it's not gone. The first time I realized it
was offset, because I started from 0. This is why I have decided to start from 1. Also I
have taken into consideration that when the stanford parser when it shows this, starts at 1,
having it as pos 1. This is helful for debugging. Root is considered on pos 0 and there was
a segmentation fault. He finds fr om pos 1 on, he considers the root there. Because the
root is not a word, it's fictitious. From there he goes to the position of the next phrase and
sees what position it is.
Also in count_pair, we see if is_opinionated_rel (relationship). It's return 1, r eturn
1, regardless of relationships. If I allow all relationships, I get more pairs. With 3807
pairs. ## shows a noun phrase, dependent relationship, meaning the head and the

Chapter 5
42
dependent. Depending is subordinated to the head. Here we get 3807 pairs if we d o not
restrict it to be just amod relationship, it generates all the relationships. When I
recompiled to leave only amod, if we take in consideration amod we only get 308, much
less, less than a tenth.
In the stanford_xml folder we find 239 documents. Whi ch were standard, I ran
Standford parser. I ran the main, I did for all files with the xml extension in the
stanford_xml directory. I made standford parser. And the result from xml I put it in
standford xml. 239 vector size is the same as being a matrix o f words, pairs per row,
dependent,head that eventually should be noun phrase and word bearer of opinion that is
adjective. In conclusion on rows we have pairs while on columns documents. After
getting through the standford parser, we process further to th e neural network.
At count_pair, is_opinionated_rel compares all relations or just some of them, or I
compared only with amod, or I can compare with others such as: nsubj, nsmod,case,
amodv. If it is positive, check if head_position> 0, it is very importa nt that the head does
not go on 0. Check if the head and pos_tagged_pos are the same. Check if is_aspect and
if yes, check if is_opinionated_word. Is_aspect, target I wanted to be NN or NNS,
singular or plural, strictly noun, I put a return 1 in any case t o let anything in. I've noticed
that more relationships are getting through. And at opinion_word again, JJ, JJR, JJS,
adjective, comparative, superlative. And I put the return 1 to get them all out. Practically
count_pair verifies if it is opinionated_rel, if yes is_aspect (head_position), if yes
is_opinion_word (dependent_position), that is, the second of the pair. If yes, create_pair
(head, head_position, dependent, dependent_position), the arguments from create_pair.
In the first phase I concurred after that we found that in fact we are interested in
aspects, to recognize the noun phrase, not just one word. Relationships of dependency are
between words. I'm not interested in just a noun, I'm interested in a noun phrase: NN,
NSS and other relaed words.We c onsider opinion words to be adjectives and targets to be
noun or noun phrases. And dependence relationships are actually between an adjective
and a noun. I have to see that noun that would be the whole aspect, not just the noun, if it
is part of the noun p hrase. The look is given by a noun phrase. Not just that noun.
So far, all the extra targets have individual words, such as weight size, however,
because many targets are phrase like battery life. Relationship is between life and
beautiful. But I know the layout / target is not just life e battery life. Even if the basic
relationship is between head noun, noun phrase and adj. All the accepted targets are
phrases we need to identify them from the extracted individual words. I just extract life
and I have to see that it's actually battery life, even if the relationship of dependence is
between life and beauty, between the basic word and the adj. In this way it makes more
sense and brings substance to our ideea.

battery_duration###battery_duration
battery_li fe###camera
battery_life###quote_k._reeves
battery_life###drawback

Chapter 5
43
battery_life###battery_life
useful_features###many

As we consider to be nouns or noun phrases we identify phrases by combining
each target word with a queue giving the queue the values 1 and 2, with 2 consecutive
nouns before and after and with adjectives before the target word, the target word, the
word aspect, such as life. After checking, at the end are all return true. We check if
is_opinionated_rel, if so, is_aspect, if so, is_opin ionated_word the dependent. And that's
why I consider the dependent_position and head_position is_aspect (head_position) and
not head that here is head and dependence in the relationship I do not have pos_taggul
and then I have to remember that the same wo rd word here is actually the first time
identified here after a position and his position tells you what it is. Here we do not know
itu POS, but you also need to know its position, because in the source, back in the phrase
in pos_tagged_sentence [150] the re is an array of pointers, where we declared a pointer to
a char word, and another to a pos_tag, identifying also the position respectively.
The information is redundant because some words can be repeated several times
in the phrase, so they give their position, but in longer phrases there is the possibility of
repeating the words.
In order to count a pair of word and its dependent, we have to follow a pipeline:
check if is_opinionated_rel, on the head_position is_aspect, the dependent
is_opinion_word. I f all conditions are met we create pair having as parameters the head,
its position, the dependent and its position and put it in a_ow_pair, in order to return it as
a created pair.
At the create_pair() function, things tend to get a little bit complicate d due to
heuristic terms. We do not want to take into consideration just a word, but a noun phrase.
We do not want to found only target, we want to find out beautiful_target. So we look at
with 2 nouns before and 2 nouns after. In the case we find an adje ctive we look ahead and
after it. When we check the part of speech, we see if (pos_tagged_sentence
[head_position]) is a part of speech, to see if it is a NN as the basic layout, or if it is NNS
position. Then if head_position – 1is pos, if there is, then pos – 1 looks if it is all noun.
First, make strcpy to copy it in a string and then shuffle and check for adjectives, such as
dependents. We check if before it and after it are other words that are of interest to us.
Usually in english the dependent is bef ore the noun, but not always. For example, if we
found out life, if we obtain the word before it: battery life, you have a more understandble
notion. Everything must be interpreted at proposition level.
In the case, where we do not found neither a nour nor an adjective, and even
before it with 2 nouns and after and with one adjective before and after, then we put the
word in the string. But if a have found a noun and also an adjective, and I reached the
end of the sentence, we stop, we do not search for an other noun.
“As we consider targets to be nouns/noun phrases, we identify target phrases
by combining each target word with Q consecutive nouns right before and after
the target word, and K adjectives before the target word. We set Q = 2, K = 1 in our
experiments.”
This was taken from [2] Opinion Word Expansion and Target Extraction through
Double Propagation made by Guang Qiu from Zhejiang University, China,Bing Liu from

Chapter 5
44
University of Illinois at Chicago, Jiajun Bu from Zhejiang University, China, Chun Che n
from Zhejiang University, China.
We have followed the same rules with Q = 2 and K = 1, 2 consecutive nouns
before and after, and 1 adjectives before. I did not consider subst adjective and again
subst. Or 2 consecutive subst, or if the first word on th e left is not a noun, maybe it is an
adjective. But I did not search after finding out an adjective for a noun because in english
adjectives are before nouns. I did not take all the possibilities because it can be endless
posibilites. But there are many o ther possibilities. This paper can be taken into
consideration for further experiments.
This search takes place until we compare the position of head_position + 1 to be
smaller than the maximum. The maximum is considered a counter (i), declared global.
After each addition of a word, the counter gets incremented. We can see this in
pos_tagged_sentence [i ++] with the remind that it always starts from 1.
It gets easy, I found a noun to concatenate it, it does not matter whether it follows
after or not. Add ing with the underscore between them, and make head_position_word
shuffle and then watch if the next two position is stricter than maximum number that can
be reached, it's a NN or NNS, and if so, i concatenate it too. And that's how I get, some
noun phras es to be more noticeable.
Example:
exposure_control###exposure_control
First we found a pair of words, one of which is head and one is dependent, but
different words. We have found a relationship of dependence between control and
exposure and both are no un and in my case, I say both are noun and make a noun phrase
from them. And if I look at the control, he sees it as an exposure in front of him and
makes exposure control and if I look at the exposure he sees I look at the control and puts
that one too. T hat's a small drawback, but it's not a problem. He found the relationship
between the two, but trying to make them noun phrases, basically brings them to be
identical. He found a relationship, that's how it works first finds the relationship, and the
relationship finds it by reading the rel in relation, file / sentence / relation. The sentence is
defined but it is read but the grammar is written correctly, the sentence followed by the
relationship. pos_tagged_word is a word followed by the relationship. Af ter that there is a
recursive succession of resurrections.I find the relationships there and there will be head
and dependence. Following the count_pair() of $ 1 that's the relationship, $ 3 is the head,
$ 5 is the head position, $ 7 is the addict, and $ 9 is the addict's position. count_pair this
is exactly it is looking for . In fact, he finds syntactic, exposure control. He finds an amod
relation between exposure, control, each is an exposure and control word, and then
considers as head exposure, and as dependent control. And checks count_pair, checks if
it's opinionated, and then creates pair, and create pair trying to find a more aggregate
noun phrase, not just a word without a meaning. Much comes to this extreme. But we
know the explanation. As he fou nd a pair of exposure and control, dependent and head,
and at the control finds that there is another exposure in front and adds to the exposure
and finds that there is a backward control and adds it. They concatenate them using strcat
function.
The leaves contain positive or negative opinions. We talk about good/bad pairs.
To calculate intermediate vectors you can not make vectors, because you do not have
vectors only on the leaves. And then it faces the sum or the vectorial mean of successive

Chapter 5
45
nodes. If th ey cannot make of succesive nodes then they can it on the sons. In fact, just
for the leaves, because only the leaves have vectors.
We read the taxonomy and convert it using the converter of the project mentor.
The tax is in the format is () and converted to the format unit, we used to convert
from the tax into the unit. This happens when running the neural network resulting in the
file drive. The aim is to get ontology.unit camera from the ontology.tax camera. We have
to consider that the neural network is lower level. So for the intermediate nodes, for
example, the root that obviously is the intermediate node, mapped vecs are all the pairs
labels that are actually the leaves of the sub tree, of the entire tree and all are
concatenated and face the vectoria l medium, the so -called centroid. Most of the
experiences we've done with centroid sum / number, that is, arithmetic average, mean
vector. Another option was categorized: that is the sum of normalized vectors, N1, but I
preferred the first option.

5.4 GHS OM unit output file

Figure 16: GHSOM.unit

Chapter 5
46
Here are considered 6 neurons, the map has the size 3 per 3. Each neuron
develops in another map. These 6 neurons develop the parent -daughter map. We can see
XDIM and YDIM equal to 3. The 6 neurons consi dered are digital
digital_camera###digital, camera###the, camera###this, canon###'s, has###camera,
ROOT###has. A neuron can develop in another map with 2 on 2 size. So it contains 4
neurons having the following positions: (0,0), (0,1), (1,0), (1,1).
GHSOM.unit 6 vectors, vectors dimension is 239 because we have 239 column
components. On arrays are pairs and columns are documents. Documents obtained after I
used the Stanford parser. Each article of care is a vector, each with its label, no longer
havin g ### but just one.
The final format is the real weight weighted vector. Because it's a mean of the
vectors mapped on that neuron. Take 4 pairs each with one vector and take, depending on
their average. 6 on the parent neuron on the root map that has 3 on 3. A neuron that has 6
neurons mapped by it. And that neuroun develops in a map of 2 on 2. 4 on a neuron, 1 on
another and 1 on another.

Figure 1 7: Data item of a neuron

Here we consider the neuron position, its id, the quantization error, how different
the vectors are between them and the weight of the neuron. The weight vector is the
average of the pairs of the mapped vectors. In the best case, there are 4 and the average of
the four vectors is at the end, with real numbers, that's average. So it's a w eight vector.
Daughter nodes that develop within these neurons. We'll find the four pairs. Here's the
camera, that's the weight vector. And then a certain order of trees begins. This is the end
goal of the program, the one that generates GHSOM.unit
For pai rs that were made I generated Opinion.dat and Opinion.tv. I generated for
cases where it was just amod and then I took all the relationships into account. Not all
were really relevant. There are other types of relationships that do not all lead to
relation ships between the target and the adjective. Only amod, subj, desc. Some notations
are different in the standford parser. Not all relationships, of these 4000, are of interest to
us.
And that classification, running with input ontology is classified from t op to
bottom according to that ontology. And the purpose is to translate from the tax format
into the unit format.
In input ontogy, I'm looking inside files in ontology, I'm looking for relevant
pairs. Not all of this is found using amod, which is too rest rictive. Not all leaves in
ontology are found here, the leaves contain ### between them,regarding pairs found in

Chapter 5
47
the corpus. In opinion or in opinion all the relationships I am able to find almost all. But
some of them are not so relevant. The abstract_vie w_finder has not been found, for
example, because it has 319, but in the one with fewer restriction is able to find it.

Dependente.l file

At relations before we start BEGIN (relations) we have yyval.str = strdup
(yytext). Consider this small and underex table, followed by (but without parentheses, the
parenthesis context, the name of the relationship, followed by the parenthesis). We need
yylval because we may need to return other values from the scanner to the parser. yytext
is an array of characters, wh ich acts as a buffer for the input currently being parsed.
The situation is repeated when we have either WORD or POS_TAG. For the
second case, only the large letters are considered, and the following example signs [A -Z.,:
'' $ -].
After reading a file that usually consists of a sentence and a relationship.There are
some exceptions, some files where a unexpected point appears, then as if it were 2
phrases. They make it consider as if they were two files. In a file it's actually just a phrase
and relationship s. Problem solved at lex level with yywrap. It is called at the end of the
file, eg EOF. And when it returns true, the program is over. Depends on what returns, 0 or
1. If we have not finished and there are files to read from the directory, then you return 0.
Then the lexico -syntax parser and yacc continue and prepare to open a new file. That's
the role yywrap. In yywrap I prepare and open the new file, so after that, the next one. It
gets ready for the syntactic parser, do not know there were actually many files. The yacc
syntax parser goes on a global matter, for it it does not matter if the lexical analyzer reads
several files in turn. The syntactic parser sees that there were several tabs and it does not
matter who analyzed them and that each file consis ts of sentence and relationships. Just
some are 2 times, it's like 2 files but actually it's a single one. It's resolved as a lexical
wrapper, that yywrap. That file does not always coincide with a physical file. This means
it is wrapped at lexical level.
That is why we have the limitation ( <), for post_tagged_sentence [i ++], when we
finish inserting the last i, it returns back to i ++. It's sort of array as vector size. When
you have an array of 100, actually the words go from 0 to 99. Randomly goes fro m 1 to
99 pt as i on the first position is reserved for root. Root is not a proper word in the phrase
and this is why it counts from 1 to n – 1.

Dependente.y file
Yywrap () I explicitly called it in dependente.y. This is typically called by t he end
of an EOF file. At the end of the 239 files, we must prepare to open another file to return
0 so that it does not stop the program and continue the lexical analysis. The lexical
analysis is lexical, to continue the analysis from the next file I just opened before
returning 0. The first thing yywrap is trying to open a file, yyparse which is the parser
generated by yacc and parses all the files in the 239. His goal is to display the data file
and tv at the end. I mean to give it work material to make the neural network file with the
pair of vectors, 239 components and finally the pair tag, and the next one.

opinion dat.flat.prop3. file

Chapter 5
48
When I give as input opinion.dat.flat.prop3, this is the property file, rolling
parameters, and then I give ontology. unit ontology camera input. This format knows how
to read, what should be the outcome. At the entrance the whole format is unit The
difference is that here are also vectors mapped here while empty here with some fictitious
vectors, which are just some labe ls.
This is a parameter and this appears as a command line parameter, it is considered
a standard input not a parameter in the command line e stdin unit is actually removing
gshom.unit This is populated with the date. The data file is input file = pairs.da t_flat
The CameraOntology.unit file, the same posed format with the coordinates in the
array, the matrix that is actually the rectangle from each node that each cell grows
underneath as another rectangle and so on. That's the idea of a tree of many horses.
Beca use each node, you have a matrix of m x n: basically each node has m x in the
subordinate successors. CameraOntolgy.unit is the input and what it removes creates
gshom.unit which looks the same. No vectors tagged at the end
Beyond the vectors with the tag at the end, here in the pos you do not have the
label. The program, the neural network update gshom 11, actually reads the property file,
which needs to read what data file of the input data file flat file we copied from
opinion.dat in pairs.dat. flat and pairs in pairs.tv (template vectors) actually documents
the 239. This neural network reads an ontology camera that currently has a blank space
only with some data labels, not the full vector data and the tag at the end. But their
vectors are node labels kn ot nodes and only the leaves have vectors. He picks up the
respective vectors from the dataset, that is, from the given pairs, from that given opinion,
there each label, but only the leaves or those pairs, the positive opinion, the negative
opinion. Only t hose have vectors. It is normal for the intermediate nodes to make the
average of all successors of the entire sub -tree. After all, only the leaves are sub -planes,
only the pos and neg leaves have vectors. Gray feel weight and burking pretty, one is
negati ve and the other is positive. First put the negative opinion and then the positive one
+ between them, make the vectorial amount, the centroid. The converter that converts
from the tax format, that makes, the order is not important. CameraOntolgy.unit and
CameraOntology.tax. If I look for weight: gray weight is positive about the intermediate
weight node and bulking weight is negative, big, big. Weight in turn is a design and
usability son, and design and usability is the son of camera.design and usability, and the
design and usability is the son of the camera.
Gray_feel_weight belongs to design_and_usability and even this subcategory
belong to camera.
And the camera has a positive and negative leaf: beautiful camera and left camera
(lost in the fog). The grammar finds just left -handed, some not so suggestive, as if you
look in the phrase. When you talk about natural and semantic language, there are many
stranger things that matter.

Chapter 7
49
Chapter 6. Testing and Validation

6.1 Validation

In this chapter I will present the te sting of our program using the unsupervised
neuronal network. The purpose of the project is to see the if the nodes are mapped
correctly . The neuron contains two leaves, each of them containing data segment vector
with a positive or a negative aspect.
I ran the prototype on review of the canon camera .The accuracy of the results has
not been measured. This implies further experiment on the results of the mapping.

Figure:18 Hexagonal topology of SOM map

The system dynamically creates the user interface as an interactive graphical
display. The display is implemented as a dynamically created HTML file. It reads the text
file containing all document categories, a file that was created by the SOM document
map. First, we see the hexagonal topology of the SOM map , that is, each neuron has six
direct neighbors. Each neuron on the SOM map displayed is labeled with the name of the
document category hosted in the neuron. The graphical interface lets you navigate the
document map from any web browser. The purpose of th is navigation is to retrieve
relevant documents in two steps. In the first step, clicking on a map neuron gives access
to an index of the category of documents hosted on that neuron. The index page is also a
dynamically generated HTML file that contains a list of links, one link for each of the
documents hosted in the neuron. Each of these hyperlinks has the name of the associated
document as a text and links to the document itself. In a second step of the navigation,
clicking on a hyperlink with the name o f a document in the index list gives access to the
mapped pairs in that document.

Chapter 7
50

Figure 19: SOM map of categories of words

Similar to viewing and navigating on the SOM document map, the graphical user
interface also allows you to navigate the SOM map o f categories of words. The interface
allows access to the index page of any neuron on the SOM map of categories of words,
thus visualizing the words from which they can be put in a pair with the symbol ###
housed in the neuron

Figure 20: SOM map of cate gories of words

And the arborescence can go even deeper, in a new layer with mapped words and
aspects.

Chapter 7
51

Figure 21: SOM maps of categories of words on deeper layer

The figure describes the content of the categories of words hosted in some of the
neurons of these two clusters of words. The mapping of the two clusters can be deduced
from the coordinates of the neural components exemplified in the figure. It is noted th at
although these two clusters are less stretched on the map and thus contain fewer neurons,
each neuron hosts a larger number of words.
An important issue in this experiment was to choose the right size for the SOM
document map to reach a map with the hig hest visual expressiveness. It's about finding an
optimal dimension of the map so that the visualization of the clusters is as obvious as
possible. Map size means the total number of neurons on its rectangular grid. For a set of
data, different map dimensi ons mean different levels of granularity. Granularity (also
called "graininess", the quality of being grainy) is the extent to which a material or
system is composed of distinguishable pieces or grains. It can either refer to the extent to
which a larger e ntity is subdivided, or the extent to which groups of smaller
indistinguishable entities have joined together to become larger distinguishable entities

6.2 Testing

The parser can read various forms of plain text input and output. It takes into
considera tion different formats, including part -of speech tagged text, phrase structure
trees and dependency relations.
When we compile it using the main of Stanford parser, where scripts are included
for the linux version of it.

Chapter 7
52

Figure 22: lexparser.sh file

Here the output was generated with the above command. It uses an java package
in order to compile it. The mx300m refers to the size of the memory, that we allow it to
being used. The script was predefined in Stanford with the size of 150, and the results
were not conclusive . For the xml files that were larger than usual, this means that they
contained a more than a sentence. When running the old version, the files resulting from
Stanford parser, weren’t able to form, having size 0.

Figure: 23 Out of me mory when running Stanford parser

Chapter 7
53

After solutioning the problem, and giving more memory, all the files should run
smoothly. The speed of execution depends very much on the size of your RAM memory.
The bigger the size, the faster the execution of the co mpiling program.

Figure 24: Successful running of Stanford parser

Chapter 7
54
Chapter 7. Conclusions
In this chapter the final conclusions, contributions and ideas about further
develop ment will be discussed. I am satisfied with the fact that , I was able to learn in
more detail Text -Based Ontology Learning and more about Web mining with self –
organizing maps. Eventhough I managed to create a experimental prototype, I think that a
lot of improvements can be added in the future. For example:
 we can verify how well the neurons were mapped on the leaves.
 another improvement would be a nice graphical interface.
 when multiple opinions with possible opposing polarities about
different aspects of the target entity are expressed in one and the same
sente nce
 we will go granularly below the sentence level.

We based our project on the similarity between the attribute vectors, the vector
representations of the phrases propagate downward, starting from the root of the
ontological structure of the target objec t.
Finally, each phrase stops on a leaf node, which always represents a polarity
(positive or negative) of opinion. Following the classification of a sentence on a leaf, the
predefined polarity of the leaf becomes a polarity of opinion that is attributed t o that
phrase.
In the grouping process to identify the issues, the vectors of phrases were bags of
words. Substantials are lexical units that express aspects. In classifying sentiments,
adjectives and adverbs are already relevant.
The descending classifica tion process is based on the calculation of the similarity
between the vectors of the classified and the vectors of the taxonomic knots traversed
downward.
Ontological arboreal leaves are not positives / negatives, but polarities of positive
opinion. A bag of words is considered to be a phrase that expresses a strong positive /
negative feeling on the aspect described in its parent node.
Only opinion phrases are annotated with: the aspect described in the sentence a
positive / negative numerical polarity of aspect -related sentiment
We have a used unsupervised neural network for classifying the positive /
negative polarity opinion. This takes into account the opinions mentioned in the opinion
sentences from product reviews
We built an ontology of aspects / vi ews for digital cameras. We have 42 concepts:
a third are intermediate nodes – objective descriptions of the aspects two -thirds are leaves
– views on the issues. For any intermediate node we build vector centroid (medium
vector) of the vectors of all nodes in the node subspace, going to all the leaves of the sub –
tree.
The main focus of our project was ontology of aspects / views for digital cameras
Only the leaves require native vectors of words words of some corpus phrases. We do not
have candidate phrases for constructing interleaved vector vectors, since there is no
annotated phrase objectively describing a look, ie annotated with a polarity of null
(neutral) feeling relative to appearance.Our experiments are based only on annotated
phrases.

Chapter 7
55
I have learned to work with unsupervised neuronal network for positive / negative
sentiment polarity classification with regard to the issues discovered in the phase of
grouping the phrases . For an unsupervised classifier, the accuracy of the learning is good
enough,as long as the classifier does not need annotated training data for a given
semantic domain
In my opinion working on this project has enriched my knowledge regarding the
Text-Based Ontology Learning, Web mining with self -organizing maps and data mining.

Bibliography
56
Bibliography
[1] Emil St. Chifu, Tiberiu St Letia, Viorica R. Chifu “A Neural Model for Semantic
Oriented Aspect Based Opinion Mining” Technical University of Cluj -Napoca,
Post-Doctoral Programme POSDRU/159/1.5/S/137516, project co-funded from
European Social Fund through the Human Resources Sectorial Operational
Program 2007 -2013
[2] Guang Qiu, College of Computer Science, Zhejiang University, 38 Zheda Rd.,
Hangzhou 310027, Zhejiang, China, Bing Liu, Department of Comput er Science,
University of Illinois, 851 South Morgan Street Chicago, IL 60607 -7053, Jiajung
Bu, College of Computer Science, Zhejiang University, 38 Zheda Rd., Hangzhou
310027, Zhejiang, China, Chun Chen , College of Computer Science, Zhejiang
Universit y, Corresponding author, 38 Zheda Rd., Hangzhou 310027, Zhejiang,
China. “Opinion Word Expansion and Target Extraction through Double
Propagation“ , 20 July 2010
[3] Wei Wei , Department of Computer and Information Science Norwegian University
of Sci ence and Technology, Jon Atle Gulla Department of Computer and
Information Science Norwegian University of Science and Technology,
“Sentiment Learning on Product Reviews via Sentiment Ontology Tree”
Proceedings of the 48th Annual Meeting of the Associa tion for Computational
Linguistics, pages 404 –413, Uppsala, Sweden, 11 -16 July 2010. c 2010
Association for Computational Linguistics
[4] Prof.dr.ing. Emil Stefan Chifu, Prof.dr.ing Ioan Alfred Letia, “Self -organizing
Maps in Web Mining and Semanic W eb” In G.K Matsopoulos, Self -Organizing
Maps, INTECH, 2010
[5] https://stanfordnlp.github.io/CoreNLP/
[6] A. Sharma and S. Dey, “Using self -organizing maps for sentiment analysis”,
Cornell Un iversity Library, 2013
[7] Brill, E. 1994. “Some advances in transformation -based part of speech tagging”,
Proceedings of the Twelfth National Conference on Artificial Intelligence
(pp.722 -727). Menlo Park, CA : AAAI Press.
[8] John R. Levine, Tony Mason, Doug Brown Paperback , “Lex & Yacc” –
2nd/updated edition (October 1992)
[9] Danqi Chen, Computer Science Department, Stanford University, Christopher D.
Manning, Computer Science Department, Stanford University, “A Fast and
Accurate Dependency Parser using Neural Networks”, 2014
[10] Sebastian Schuster and Christopher D. Manning 2016 Enhanced English Universal
Dependencies : An improved Representation for Natural Language Understanding
Tasks In LREC 2016

Appendix 1
57
Appendix 1 (only if needed )

Similar Posts