Diploma project [606593]
1
UNIVERSITATEA POLITEHNICA BUCUREȘTI
FACULTATEA DE AUTOMATICĂ ȘI CALCULATOARE
DEPARTAMENTUL CALCULATOARE
Diploma project
Improving WordNet with the help of Word2Vec
Coordonator științific:
Conf. Dr. Ing. Costin Chiru
Absolven t:
Alexandru Ionescu
Bucharest
2017
2
Content
1. Introduction ………………………….. ………………………….. ………………………….. ………………………….. …………… 6
2. Word2Vec Skip -Gram Model ………………………….. ………………………….. ………………………….. ………………. 8
3. WordNet ………………………….. ………………………….. ………………………….. ………………………….. ……………… 12
4. Related Work ………………………….. ………………………….. ………………………….. ………………………….. ……….. 16
5. Application Architecture ………………………….. ………………………….. ………………………….. ……………………. 18
6. Description of Software and Hardware used ………………………….. ………………………….. ……………………… 21
7. Obtained Results ………………………….. ………………………….. ………………………….. ………………………….. …… 24
8. Use Cases ………………………….. ………………………….. ………………………….. ………………………….. ……………. 29
9. Analysis of results ………………………….. ………………………….. ………………………….. ………………………….. … 31
10. Conclusions and Future Work ………………………….. ………………………….. ………………………….. …………….. 34
11. Bibliography ………………………….. ………………………….. ………………………….. ………………………….. ………… 36
3
Table of Figures
Figure 1. Pair of words processed for a sentence when the window is equal to 2 ………………………….. ……………. 8
Figure 2. Word2Vec model [10] ………………………….. ………………………….. ………………………….. ………………………… 9
Figure 3. Model hidden layer related to resulted vectors [10] ………………………….. ………………………….. …………. 10
Figure 4. Fragment of noun synse ts structure [12] ………………………….. ………………………….. ………………………… 13
Figure 5. Abram Handler k -part distance final results [14] ………………………….. ………………………….. ………………. 17
Figure 6. Application architecture ………………………….. ………………………….. ………………………….. …………………… 18
Figure 7. Tabs view open on top 10 Wu Palmer Nouns ………………………….. ………………………….. …………………… 28
Figure 8. Example run noun search for bridal_wreath.n.01 ………………………….. ………………………….. …………….. 29
Figure 9. Example run noun search for imperialism.n.01 ………………………….. ………………………….. ………………… 30
4
Table of Table s
Table 1 . WordNet Statistics – Number of unique word s and synsets ………………………….. ………………………….. .. 14
Table 2. Top nouns deprecated connections according to the absolute difference between Wu Palmer
similarity and cosine similarity ………………………….. ………………………….. ………………………….. ……………………….. 24
Table 3. Top verbs deprecated connections according to absolute difference between Wu Palmer similarity and
cosine similarity ………………………….. ………………………….. ………………………….. ………………………….. ……………….. 24
Table 4. Top verbs connection according to a bsolute difference path – cosine similarity ………………………….. .. 25
Table 5. Top nouns connection according to absolute difference lch – cosine similarity ………………………….. …. 25
Table 6. Top verb connections according to absolute difference lch – cosine similarity ………………………….. ….. 25
Table 7. Top strong hypernym/hyponym connections according to cosine similarity ………………………….. ……… 26
Table 8. Top weak hypernym/hyponym connections according to cosine similarity ………………………….. ……….. 26
Table 9. Weak Synonyms connections according to cosine similar ity ………………………….. ………………………….. .. 26
Table 10. Strong Synonyms connections according to cosine similarity ………………………….. ………………………… 26
Table 11. Comparison between square error of path r elated similarities ………………………….. ……………………… 31
Table 12. Overview of reviewed improvements ………………………….. ………………………….. ………………………….. … 33
5
Abstract
Natural language processing has always been a high profile topic in computer science. Some of the
problems the scientists are trying to solve are: natural language understanding , question answering , sentiment
analysis, the creation of artificial chat bots. In order to help solving those kind s of problems, a database
containing a large number of co ncepts/words and the connections between them has been created. Th at
database is named WordNe t. It was created at Princeton University by p eople manually adding data .
Maintaining a very large database that is always changing is hard, es pecially when the databases in
question it’s used in different nat ural language processing algorithm s. In order to maintain WordNet up to date
several person are assigned to personally alter the concepts and connections of WordNet, but those person s
are bias to human er ror.
The purpose of this paper is to create a proof on concept regarding the improvement of the human
generated database WordNet using computer generated information from Word2Vec.
We will try to change WordNet content , using information from existing c orpora. The main method
used to achieve this goal is by comparing the result of path based similarities between concepts: Wu and
Palmer similarity, Leacock and Chodorow similarity and Path similarity, with cosine similarity between the
Word2Ve c vectors of the same concepts.
One way to improve WordNet is by adding new concepts chosen from the Word2Vec corpus which
have strong connections with words existing in WordNet.
Another way to improve WordNet is by updating its existing connection in or der to remove links that
have become deprecated and create links that have become relevant.
6
1. Introduction
WordNet [1 ] is a lexical ontology for English created by the Cognitive Science Laboratory
from Princeton University. It is based on grouping the words considering their parts of speech (noun,
verb, adjective, adverb) and the concepts that they rep resent in specific contexts. The structure of
WordNet is created by adding links between related concepts
WordNet is used for numerous practical applications , such as: sentiment analysis [2],
automated text summarization [3], word sense disambiguation [4], question answering [5], artificial
intelligence for chat bots [6].
One important application built on top of WordNet is WN:Similarity [7], which offers the
possibility to compute the semantic similar ity between two words/concepts. There are multiple
ways to compute the semantic similarity : using the path between concepts, their information
content, their attributes, or various combination of these . In this paper, we will only address path
based similarity.
Semantic Similarity is very important in comp utational linguistic and natural language
processing, so having this sema ntic similarity value as close to reality is crucial in the accuracy of text
processing based applications.
However, since WordNet is a human curated database, there can occur human errors in the
creation of concepts and the relations between them. Also, concepts or links between them may
become deprecated or insufficient in time. Thus, there appears the need of a method of updating
the information from WordNet by adding or deleting c oncepts and links between them so that the
similarity that is computed based on WordNet to reflect as much as possible the current reality.
Since its creation, in 1995, WordNet suffered multiple modifications aimed at improving its
coverage and the corr ectness of its links. This paper also presents an approach that is intended to
improve the quality of WordNet by signaling concepts that should be added to the ontology, along
with concepts whose meaning has changed in the meanwhile and thus their connecti ons should be
updated (some of them should be deleted, while some others should be added in the database). To
do that, we considered the semantic distances provided by another resource – Word2Vec [8]– that
was built in a different manner. Thus, by combinin g the information from the two lexical resources,
we hope to obtain a more accurate resource, along with a methodology for automatically updating
the content from the WordNet database to better reflect the current meanings of the words.
Word2Vec [8] is a resource for computing the similarity of two concepts by analyzing the
context in which they appear in a very large text corpus. It was created by a group of researchers
from Google lead by Thomas Mikolov. The algorithm behind Word2Vec returns a numeric v ector for
each word from the corpus, these vectors having a certain property: the vectors are grouped so that
vectors representing words with similar context to be closer to each other in the created vector
space.
In this paper, we compared the similar ities obtained using WordNet path based methods
with the ones obtained using the cosine similarity between multi -dimensional vectors from
Word2Vec. Having a big cosine similarity on a pair of words that have a small WordNet similarity
means that the path b etween those words is too big and one or more connections should be added.
7
Subsequently having a small cosine similarity, and a big path based similarity means that there is a
possibility that the connections between those concepts have become partially de precated and
should be updated by removing some of them.
The corpus used for building the Word2Vec vectors has significantly more words than
WordNet (approximately three million words in Word2Vec compared to approximately two hundred
thousand words in W ordNet). Thus, to see what words/concepts may be added to WordNet, we can
analyze the words from Word2Vec corpus that are not in WordNet and if a word has multiple strong
connections (using a given heuristic) with existing words from WordNet, we can assume that this
word can be added in WordNet.
The proposed method to analyze and improve WordNet is not guaranteed to provide correct
suggestions and it should be used as a tagging system like the Paypal fraud system [9], for example.
Paypal security algorit hms do not say for sure if a transaction is a fraud or not, but they tag it as
conspicuous and a human is assigned to check its validity. Similar to the Paypal system, we cannot
say if a connection or a concept/word should be added/deleted, but we can tag connections and
concepts and have a human check the validity of making those changes.
8
2. Word 2Vec Skip -Gram Model
Word2Vec [8] trains a neural network with a single layer to solve a specific problem, but
afterwards the neural network is not used for the same problem for which it was trained. The real
purpose is to learn the values of the hidden layer and these values will be in the end the word
vectors that the algorithm generates.
2.1 The Fake Problem
The neural network is trained to solve the following pr oblem: being given a large corpus of
text for training (having a vocabulary of 10,000 words), the network should receive as input one
word that is randomly chosen from the text. Then, by randomly picking another word from the
context of the first word (the words surrounding that word), the network should output the
probability of each word from the corpus to be this word.
The context of a word is represented by a window of visibility of fixed size that is given at
the start of the algorithm. For example, i f we choose the size of the window to be 15 the network
will be looking at 15 words in front of the input word and 15 words after the input word.
The output containing the probabilities will give us the probability for each word from the
dictionary to be closer to the input word. For example, if we give the neural network the word “Sea”,
the output probabilities will be bigger for the words “Black” or “Baltic” then for words that have no
link with the input, like “tennis” or “football”.
The neural networ k is trained by being feed with pairs of words found in the training
document. The following image (Fig. 1) shows an example of word pair for a given sentence when
the process window is equal to 2.
Figure 1. Pair of words proces sed for a sentence when the window is equal to 2
The network will learn statistics from the number of occurrences of each pair. For example,
the network has a higher probability to receive as an input multiple pairs (Black, Sea) in comparison
with (tennis, Sea). When the training phase is finished, if we receive the word “Sea” as an input, it
will have a higher probability for “Black” then for “tennis”.
9
2.1.1 Model Details
A neural network is not designed to receive input as unprocessed text. The words given as
an input have to get through a processing phase , so that the neural network can recognize the
information and learn from it. The creators of Word2Vec choose to create a vocabulary containing
all the words from the training text. Then, each word w ill be represented using a one out of k vector
representation. This vector will have 1 on the position of the word, and 0 on any other position.
Bellow , in Fig. 2, we have a visual representation of the Word2Vec mod el. In Fig. 2 we see an
example of a w ord representation, how that word is propagated through the hidden layer and how
the output l ayer creates the probabilities for each word.
Figure 2. Word2Vec model [10]
There is no activation function on the hidden layer, but t he output uses Softmax. When the
network is trained on word -pairs, the input is a vector representing the input word, and the output is
another vector representing the output word.
When we evaluate the neural network on an input word, the output will be a distribution
of probability, a vector having the size of the vocabulary, and each element will represent the
probability that the word from that index to be the word that we have randomly selected from the
window of the input word.
2.1.2 The Hidden Layer
The model published by Google on the official page of Word2Vec consists of vectors of
three hundred values for each word. If we want to learn such vectors, the network has to learn
encoders for each word that have 300 dimensions. In other words, the hidd en layer of the network
will have 300 hidden neurons and the activation matrix for this layer will be represented by a matrix
with 10,000 rows and 300 columns, one for each hidden neuron. The number of neurons from the
10
hidden layer is a hyper -parameter tha t needs to be chosen (tested) with respect to the application in
which the resulted vectors will be used.
In Fig. 3 we can see how the weights of the hidden layer matrix are mapp ed in the created
word vectors.
Figure 3. Model hid den layer related to resulted vectors [10]
2.2 Improvement to Word2 Vec Skip -Gram .
Even though the Word2Vec skip -gram method gives fine results, the high computational
time and the inability to compute the vectors of pairs of words determined research ers to find ways
to improve the way the text is processed.
The solution came with the development of Word2Vec Negative Sampling which was
released in a second paper [11] after the Word2Vec Skip -Gram paper [8].
The following solutions were provided:
1. Groups of two or more words are treated like single words if they appear often
enough, meaning that all frequently used expressions are mapped with an
Word2Vec vector.
2. The elimination of highly frequent words to increase the speed of the algorithm
by reducing the number of input words.
3. A new method for updating only a small percentage of the model weights in the
training phase.
The improvements were seen not only in the speed of the training phase but also in the
reduction of the noise in the vector space. Th e quality of the final vectors has improved as well.
11
2.3 Similarity Using Word2Vec.
In order to compute the similarity between two word s using the vector resulted from the
method described above, an option is to compute the cosine similarity between th e Word2Vec
numeric vectors o f the corresponding words.
Cosine Similarity is a way to compute the similarity of two vectors that have non -zero norm .
This a method w here not the vectors magnitude is important, but the ir orien tation . If they have the
same orientation , their similarity will be 1, if they form a 90° angle their similarity will be 0 , and if
they have a 180° angle between themselves, their cosine similarity will be -1.
The formula for computing the cosine similarity is derived from the Euclide an dot product
(1):
(1)
Having two numeric vectors A and B each with n elements the following formula (2)
computes the cosine similarity between them.
(2)
12
3. WordNet
WordNet is a human curated database containing English words, grouped in four
structures according to their membership to a part of speech (nouns, verbs, adjectives, adverbs). In
each structure the words are grouped in concepts which are called synsets that contain the words
having the same meaning in a given contex t. Synsets are connected with each other in a tree like
structure using different types of connections.
3.1 WordNet Synsets
WordNet has a tree like structure; the main element of WordNet is the synset. A synset
(short for synonymous set) represents a co ncept and contains multiple words that may be used to
express that concept. The main propriety of words contained in a synset is that given a certain
context all words from a synset are interchangeable.
A synset also contains a short definition of the con cept, along with an example. Each synset
is connected with other synset using certain relations.
Synset names have the following structure:
– concept_main_name.part_of_speech.index_of_synset
For example: dog.n.01
– synset related to dog
– ‘n’ short for n oun
– 01 means the first synset for ‘dog’ nouns
Below, we have an example of a synset with its main elements:
Synset “dog.n.01”:
– Lemmas: “dog”,”domestic_dog”,”Canis_familiaris”
– Definition: u'a member of the genus Canis (probably descended from the c ommon
wolf) that has been domesticated by man since prehistoric times; occurs in many breeds'
– Hypernyms: 'canine.n.02' ,'domestic_animal.n.01')
– Hyponyms: multiple synsets for each race of dog, Example : ‘corgi.n.01’
3.2 WordNet Relations for Nouns
Nouns Synset are grouped in a hierarchical structure, having as the root node, the synset
entity.n.01.
The main relations between noun synsets are:
1. Hypernymy :
– A concept X in a hypernym of the concept Y if the Y concept is contained in the X
concept . Example: “House” is a hypernym of “door”.
2. Hyponymy :
– Is the reverse relation of hypernymy a concept is a hypernim of another concept if
the first concept in contained in the second one .
13
3. Meronymy :
– A synset X is a meronym of the concept Y if the X synset is part of the Y synsets
4. Holonymy :
– Is the reverse relation of meronymy, for example “building” is a holonym for
window
5. Coordinate terms :
– Two synset are coordinate terms if they share a common hypernym .
A visual example of WordNet synsets and connections can be seen in Figure 4.
Figure 4. Frag ment of noun synsets structure [12]
3.3 WordNet Relations for Verbs
Verbs are also structured in a tree like structure . Unlike the nouns structure, where t here is
a root node “ entity.n.01 ” which is a hypernym for all the other nodes, the verbs tree do not have
such a node , but to allow the comput ation of similarity using path related heuristics, a virtual/fake
node can be added to have a path between any two nodes in the verb set.
The relation for verbs are similar to those for nouns:
14
1.Hypernymy : A verb is a hypernym for another if the action for the second verb is a (kind
of) action of the first verb.
2.Troponymy : Equivalent of the hyponymy relation for n ouns, a verb is a troponym of
another verb if the action represented by the first verb is a type of action of the second one.
3. Entailment : This is a special relation for verbs. A verb X is entailed by a verb Y if for doing
the action X , the action Y is a prerequisite. For example , the verb “to play” is entailed by the verb “to
win”.
4. Coordinate terms : As in the noun structure, two verbs are coordinate terms if they have
a common hypernym.
3.4 WordNet Adjectives
WordNet adjectives are not structured in a ranked set, they have a special arrangement.
They are arranged in pair of strong antonyms such as “cold” – “hot” which are surrounded by
satellites of weaker (not so used antonyms) connected by similarity relationships.
3.5 WordNet Adverbs
The specia l characteristic of adverbs is that some of them are derived from adjectives, those
who have this property have a link to their corresponding adjective.
3.6 WordNet Statistics
WordNet is constantly being improved, these are the numbers taken from the off icial
website of WordNet taken at the time of writing this paper.
Table 1. WordNet Statistics – Number of unique words and synsets
Part of speech Number of words Number of synsets Word – Sense pairs
Noun 117798 82115 146312
Verb 11529 13767 25047
Adjective 21479 18156 30002
Adverb 4481 3621 5580
Total 155287 117659 206941
3.7 WordNet Similarities .
There are multiple methods to compute the similarity between two WordNet concepts .
There are methods based on the path between c oncepts, their information content, their attributes,
or various combination of these.
There are many types of path bases similarities, but out of these, we only considered three
in this paper:
1. Path Similarity [7]– Is a heuristic where the similarity bet ween two concepts is
computed considering the shortest path between them that connects the concepts in
the tree (using the hypernym/hyponym relationships). Unlike the nouns group, the verbs
do not have a root node, so a fake root has been added to allow th e computation of the
similarity between any two verbs. The value returned is in the [0, 1] interval.
15
2. Leacock -Chodorow Similarity [7] – Just like in path similarity the similarity is computed
according to the shortest path between concepts , but it is combin ed with the maximum
depth in the tree like structure where the concepts occur. The final similarity value is
represented by the following formula log(p/2d) , where p is the length of the shortest
path and d is the maximum depth in tree .
3. Wu-Palmer Similarity [7] – this method considers two different facts: the depth of the
two senses in the taxonomy and the depth of the most specific ancestor in the WordNet
tree, which is called LCS (least common subsumer).
There are also a couple of types of context based s imilarity. A particularity of these
methods is that they are dependent of the corpus that was used when creating the information
content. NLTK Python WordNet library has a couple of pretrained dictionaries that can be imported
and used as parameters in the se similarities.
1. Resnik Similarity [7] – This type of semantic similarity provides a value depending on
how two words are related considering the information content of the LCS.
2. Jiang -Conrath Similarity [7] – Just like in Resnik similarity , the information content of
the LCS is used , but the information content of the two compared synsets also matters.
Formula (3) is used to compute the final similarity value , where IC is the information
content , JCN is the Jiang Conrath Similarity, S1 is the first synset , S2 is the second one
and LCS is the least common subsume.
JCN = 1/(IC(S1) + IC(S2) – 2 * IC(LCS)) (3)
3. Lin Similarity [7]– Similar with JCN, Lin similarity takes the same parameter but the
final formula is dif ferent (4):
LIN = 2 * IC(lcs)/(IC(S1) +IC(S2)) (4)
16
4. Related Work
There is not much research analyzing the relations between WordNet similarities and
cosine similarity computed using the Word2Vec vectors. O ne paper [13] made an attempt to
improve the accuracy of the Japanese WordNet us ing Word2Vec . Another paper [14] tried to create
for each word a list of top 200 closest words using cosine similarity, and see the correlation between
the word rank and the relation in WordNet
4.1 Improve Japanesse WordNet [13]
The J apanese WordNet is the J apanese equivalent of the English WordNet created at
Princeton. It was created in 2006, and ever since it was prone to continuous development. The
structure of the lexical database was copied from English WordNet, the concept of synsets and links
between synsets re mains. Although the J apanese WordNet is not merely a translation of English
WordNet, linguistics differences led to the decision to remove certain synsets and add some others.
Previous studies [15, 16 ] have shown that approximately 5% of the inputs in the Japan ese
WordNet may contain errors, s ome of them being words that are placed in the same synset even
though they are not synonyms. Other types of errors are represented by synsets that do not have
the right connections or are not connected at all, causin g WordNet similarities to give unrealistic
values.
A method using cosine similarity between the Word2Vec vectors of the elements of
synsets has been tried, but the results were not that good. In [13], the authors used decision trees to
check the validity of hyponymy relationships. The study tried to verify two hypotheses : that vector
space generated by Word2Vec has “noise” that makes the cosine simila rities values not that
accurate but there exists a subspace of the Word2Vec space that is relevant in detec ting synonyms .
The seco nd hypothesis wa s that Word2Vec can be used to find the sense of synonyms members of
the same synset.
As part of their experiment they generated Word2Vec vectors using documents from
Wikipedia. They run the training algorithm twice: once to get vectors with size 200 and the second
time to get vectors with size 800.
Their paper final conclusions were that using a relative small number of Word2Vec
embedded vectors it is possible to find words connected in the Japanese version of WordN et lexical
database and that there is noise in the vector space generated by Word2Vec vectors that limit the
use of cosine similarity score in locating related words.
4.2 Relatedness of Distance in Word2Vec Space with WordNet Relations [14]
Word2vec cre ates a vector space of words that are related, but the relations that the cosine
similarity shows are not clear. In his master thesis [7], Abe Handler investigated the probabilities for
two words that are k – distance apart in Word2Vec to have a connection in WordNet.
The k part rank is generated for each word, or in other words, each word has a ranked list
with words that are closer to it. This list is generated by computing cosine similarity with every other
17
word in the training corpus and then order th e results. If a word A is in the top 200 rank list of word
B, it does not necessarily mean that word B is in the top 200 words of A.
Abram Handler ’s study wanted to show what is the correlation between the Word2Vec
ranking using cosine similarity and the relation between the two words in the human curated English
words database WordNet.
The experiment was run on nouns and the relation tested were those of: synonymy,
hypernymy, hyponomy, meronymy, and holonymy.
His assumption was that if two words are in a small distance in the Word2Vec vector space
there must be a connection between them in WordNet.
For the experiment, he selected 10,000 random words from NLTK Reuters corpus and
determined the average number of connections for each word.
Fig. 5 sh ows the final result of the experiment, it entails all the conclusions: for the distance
smaller than 50, synonyms and hypernyms have a high connections number, after which it tends to
fall, they still have some connections even at 200 but 3 or 4 times sma ller. When k is smaller than 50
the holonyms, meronyms and hyponyms have some connections but after that the precision goes
near 0.
Figure 5. Abram Handler k -part distance final results
18
5. Application Architecture
The entire project is divided in two big structures: a data processing structure represented
by the importing of the two main databases used in this application: the official Word2Vec corpus
obtained by training on Google news corpus that contains approximately three billio n words with
three million distinct words or groups of words converted to Word2Vec form, and WordNet human
curated database which is loaded using python NLTK tool. The Word2vec Corpus is basically a
dictionary connecting a textual form of the word with its Word2Vec vector form. The Word2Vec
form can be used to compute cosine similarity between the two words.
After the two corpuses have been loaded several experiments are made on the two
datasets.
5.1 Main Architecture Diagram
Figure 6. Application architecture
In Fig. 6 we see the main flow of the operation s that occur in the application. The
application ’s general purpose is to improve WordNet using Word2Vec corpus. In order to do that
both databases , are loaded using python libraries. WordNet is loaded using Python NLTK [17].
Word2Vec Corpus is loaded using python Gensim [18] library. After that initial phase, there comes a
19
combine and process phase where different test s are being made. In order to save the result s, a
MySQL database is being used. The web application that shows the results in a graphical form
connect s to the MySQL database to get the necessary data.
5.2 Combine and Compare Process
The purpose of this phase was to analyze WordNet using the Word2Vec corpus. Using this
analysis, we can improve the WordNet database . Thus, n ew concepts /connections can be
added/deleted . Also, during t his phase we may discover that new type s of connection are relevant
and should be added in WordNet . In order to achieve these goal s, several tests have been
performed.
5.2.1 Comparison Between Synonyms in Each Synset
In this test, we computed the cosine similarity for each pair of synonyms from each
WordNet synset to see if any word should not be in that specific synse t.
5.2.2 Comparison of Synsets with First Grade Connections
In this test, for all grade one connection of hyponymy, hypernomy, meronymy , we
computed the cosine similarity and also all three path -based WordNet similarities.
The purpose of this test was to see i f there are any first-grade connections that should be
deleted. If a connection has a weak cosine similarity and a strong path similarity , then there is a
chance for that specific connection to be deprecated or added incorrectly.
5.2.3 Comparison Between Synsets
This step was the most demanding task from the computational point of view. This step,
similar to the previous one, was performed separately for nouns and for verbs being constrained by
WordNet structure that is divided a ccording to the part of speech.
Each synset has been compared with all the other synsets, taking into account cosine
similarity and a ll three path similarities. For each synset , the top ten connection s with major
differences between cosine similarity and WordNet similarity h ave been stored in three different
databases , one for each type of WordNet Similarity.
This algorithm has a complexity of O(n^2), where n is equal to the number of synsets
processed. For nouns that number is 80,000 and for verbs n = 13,000.
5.2.4 Word s to be Added Test
The purpose of this test was to see what words from Word2Vec corpus can/should be added
in WordNet.
In this step, only cosine similarity was used. Each word from Word2Vec corpus which is not
in WordNet, was compared with all the words from WordNet, using the cosine similarity between
20
the Word2Vec form of these words. If a new word has multiple connections with WordNet words,
then that word is added in a “could be added database” for further human supervised analysis.
5.3 Database Stru cture
There are six main tables that have been created and populated for the purpose of this
paper. One of them is Distance_noun_wup – A table containing ten columns: Id -int(11), Noun1 –
varchar(100), Noun2 -varchar(100), Cosine_dist -double, Path_dist -doubl e, Wup_dist -double,
Lch_dist -double, Path_cos -double, Wup_cos -double, Lch_cos -double . This table contains, for each
synset, the top ten connections ordered by the absolute value of the difference between the Wu
Palmer similarity and the cosine similarity r epresented by the Wup_dist column.
The other five main tables are: Distance_noun_lch, Distance_noun_path,
Distance_verbs_path, Distance_verbs_lch, Distance_verbs_wup. These tables have the same format
of the Distance_noun_wup table, the only differences being the part of speech and the similarity
function that were used for choosing the top 10 connections.
In each table, all distances are stored to ease the comparison between them.
5.4 Web Interface
The web interface was implemented using: HTML5, Ja vascript, CSS, and Mysql.
The most important part of the graphical interface is the vis [19] framework for javascript
which allows the user to draw interactive graph like structure. In this interface the graph is
represented by the synsets/words and the connections (edges) between them.
The angular javascript framework was used for the implementation of the tabs and to
make the ajax connection between javascript and the server that contains th e data required to
generate the local graph .
21
6. Description o f Software and Hardware used
To run the high memory and time demanding processing , a Google cloud server was used.
6.1 Hardware Used
In order to perform the highly computational problems, a Google cloud server was used.
The motives for using a cloud serv er was the time required to compute all those similarities and the
storage required to save them.
Google cloud platform can be used to deploy a huge number of servers in a short time,
having the protections of over 700 engineers to solve any problem that may occur.
For the problem at hand, a custom Google cloud virtual machine was created, having the
following characteristics :
1. Virtual machine name: projectWord2Vec
2. CPU platform: Intel Haswell
3. Zone: Europe -west1 -d
4. Internal IP: 10.132.0.3
5. External IP: 104.1 99.28.180
6. Machine type:
o 4 vCPUs
o 16 GB memory
To connect to the virtual machine, Google compute ssh api was used. The full command for
establishing a connection is: gcloud compute –project "massive -hub-166018" ssh –zone "europe –
west1 -d" "project Word2Vec ".
Google provides a web interface based secure shell connection, but that was not used
because it is not stable, meaning that if the machine is under high computational pressure the
secure shell (ssh) connection fails.
The 16 GB random access memory was chosen in order to be enough to load the pre –
calculated Word2Vec vectors and WordNet at the same time, without being necessary to load
anything in the hard memory.
6.2 Software Used
6.2.1 Python NLTK WordNet
NLTK [17] is a python library to work with human language information data. It has easy
build in interfaces to work with multiple datasets and resources, WordNet being one of them. It has
implemented multiple algorithms for text processing such as tokenization, classification, stemming,
semantic re asoning.
22
6.2.2 Python Gensim library
The original Word2Vec library , which was released by the team lead by Thomas Mikolov ,
was written in c -programming language , making it difficult to write code fast and test it.
The Gensim Word2Vec [18] library is a python wrapper for the c language ’s original library
making it easier to use Word2Vec. There is also a python implemented Word2Vec but even though it
is implemented using numpy [19], the python version is 70x slower than the one implemented in
plain c.
Prerequisite for installing gensim python Word2Vec:
1. C compiler
2. At least python 2.0
This library can process large text and provide Word2Vec vector for the words, having the
possibility to save the results in a binary file, which can be later used withou t having to go through
the training phase. Also, it is compatible with the corpus made available by Google, which has been
trained on the Google news corpus having 3,000,000 unique vectors.
Example of use: Code used to load Google corpus, provided that is downloaded in the
current folder:
model=KeyedVectors.load_ Word2Vec _format(' Google News -vectors -negative300.bin', binary=True)
6.2.3 Python database mysql connector
In order to not lose any data in case of a server or software malfunction, after each sy nset
was processed , its top 10 connections according to differe nt heuristics were added to a MySQL
database using a python database library.
6.2.4 LAMP Server
Although php was used only in the presentation part of the project, the virtual machine can
be used as a LAMP machine ( Linux, apache, mysql, php).
The operating system is “Debian GNU/Linux 8 (jessie)” which does not have installed a
Graphical User Interface (GUI interface), all the operations being made using the command line. No
other software wa s installed that was not installed software that were not required in this project.
MySQL relational database is used for the presentation and also in the computing phase to
save incremental progress to the database.
23
6.2.5 Javascript
For the pres entation o f the concepts and links between them , Vis Javascript drawing
library1[20] was used. The Javascript code makes an Ajax call to the server to get a JSON file
containing the required data to draw the graph for a given synset.
1Vis – http://visjs.org/
24
7. Obtained Results
There are multiple anal yses that were made, each being tested for each type of similarity.
Several ways to improve WordNet have been made possible regarding the connections
between concepts. We can see if a connection of first grade should be deleted. We ca n also see if in
the path between two synsets one or more connections should be added. Subsequently , we can see
if a first grade connection should be deleted or if in the path between two concepts there is a
connection that is deprecated.
7.1 Wu Palmer Si milarity results
In Table 1 are presented the top 7 pairs of synset s where the value of the absolute value of
the difference between Wu Palmer similarity and cosine similarity is maximal, which mean s that a
connection or more between those two synsets is wrong or not actual.
Those are just brute number s. Their usefulness will be debated in the analysis section of this
paper.
Table 2. Top nouns deprecated connections according to the absolute difference between
Wu Palmer similarity and cosine similarity
Noun1 Noun2 Cosine
Similarity Path
Similarity Wup
Similarity LCH
Similarity Path_cos Wup_cos LCH_cos
financier.n.01
cooke.n.02
-0.1567 0.5 0.94736 2.9444 0.6567 1.1041 3.101
brooke.n.01 key.n.07 -0.1699 0.3333 0.9090 2.5389 0.5032 1.0789 2.7088
frontbencher.n.01 benton.n.02 -0.1682 0.3333 0.9090 2.5389 0.5015 1.0773 2.7072
crosby.n.01 grant.n.05 -0.1667 0.3333 0.9090 2.5389 0.5000
1.0758 2.7057
owner.n.01 pulitzer.n.01 -0.1565 0.3333 0.9166 2.5389 0.4899 1.0732 2.6955
city.n.01 basel.n.01 -0.1215 0.5 0.9473 2.9444 0.6215 1.0689 3.0660
general.n.01 wellington.n.01 -0.1006 0.5 0.9677 2.9444 0.6006 1.0683
3.0450
Table 3. Top verbs deprecated connections according to absolute difference betwee n Wu
Palmer similarity and cosine similarity
Verb1 Verb2 Cosine
Similarity Path
Similarity Wup
Similarity LCH
Similarity Path_cos Wup_cos LCH_cos
finalize.v.01 horsewhip.v.01 -0.2599 0.1428 0.25 1.3121 0.4027 0.5099 1.5721
pioneer.v.03 disgruntle.v.01 -0.2226 0.125 0.2222 1.1786 0.3476 0.4448 1.4013
gnarl.v.01 attend.v.03 -0.2129 0.1428 0.25 1.3121 0.3557 0.4629 1.5251
announce.v.03 horsewhip.v.01 -0.2124 0.125 0.25 1.3121 0.3557 0.4629 1.5251
premier.v.02 evert.v.01 -0.2087 0.1666 0.2222 1.1786 0.33 74 0.43468 1.3911
top.v.08 bereave.v.01 -0.2022 0.1 0.1818 0.9555 0.3022 0.3840 1.1577
pluralize.v.01 quarter.v.01 -0.19680 0.08333 0.1538
0.7731
0.2801 0.3506 0.9699
25
7.2 Path similarity results
Just like the Wu P almer results these are brute nu mbers .
Table 4. Top verbs connection according to absolute difference path – cosine similarity
Verb1 Verb2 Cosine
Similarity Path
Similarity Wup
Similarity LCH
Similarity Path_cos Wup_cos LCH_cos
trice.v.01 raise.v.02 -0.1033 0.5 0.8 2.5649 0.6033 0.9033 2.6682
mortice.v.02 join.v.02 -0.1002 0.5 0.8 2.5649 0.6002 0.9002 2.6651
miter.v.03 join.v.02 -0.0889 0.5 0.8 2.5649 0.5889 0.8889 2.6538
burden.v.01 plumb.v.02 -0.0871 0.5 0.8571 2.5649 0.5871 0.9442 2.6520
construct.v.01 groin.v.01 -0.0852 0.5 0.8 2.5649 0.5852 0.8852 2.6501
steep.v.02 infuse.v.03 -0.0667 0.5 0.9090 2.5649 0.5667 0.9758 2.6317
travel.v.01 hiss.v.02 -0.0662 0.5 0.4 2.5649 0.5662 0.4662 2.6312
7.3 Leacock -Chodorow similarity results
Table 5. Top nouns connection according to absolute difference lch – cosine similarity
Noun1 Noun2 Cosine
Similarity Path
Similarity Wup
Similarity LCH
Similarity Path_cos Wp_cos LCH_cos
financier.n.01 cooke.n.02 -0.1567 0.5 0.9473 2.9444 0.6567 1.1041 3.1012
city.n.01 basel.n.01 -0.1215 0.5 0.9473 2.9444 0.6215 1.0689 3.0660
businessman.n.01 cornell.n.02 -0.1157 0.5 0.9523 2.9444 0.6157 1.0681 3.0602
city.n.01 aquila.n.02 -0.1070 0.5 0.9473 2.9444 0.6070 1.0544 3.0515
norn.n.01 urd.n.01 -0.1011 0.5 0.9523 2.9444 0.6011 1.0535 3.0455
psychologist.n.01 rogers.n.03 -0.1008 0.5 0.6315 2.9444 0.6008 0.7324 3.0453
general.n.01 wellington.n.01 -0.1006 0.5 0.9677 2.9444 0.6006 1.0683 3.0450
Table 6. Top verb connection s according t o absolute difference lch – cosine similarity
Verb1 Verb2 Cosine
Similarity Path
Similarity Wup
Similarity LCH
Similarity Path_cos Wup_cos LCH_cos
champion.v.01 champ.v.01 0.8719 0.0625 0.1176 0.4855 0.8094 0.7542 0.3864
grin.v.01 smile.v.02 0.8604 0.1 0.4 0.9555 0.7604 0.4604 0.0951
differentiate.v.04 distinguish.v.03 0.8512 0.1 0.1818 0.9555 0.7512 0.6694 0.1042
differentiate.v.04 distinguish.v.01 0.8512 0.1 0.1818 0.9555 0.7512 0.6694 0.1042
revitalize.v.02 rejuvenate.v.01 0.8106 0.0625 0.1176 0.485 5 0.7481 0.6929 0.3251
hamstring.v.02 groin.v.01 0.8363 0.0909 0.1666 0.8602 0.7454 0.6697 0.0238
calendar.v.01 calender.v.01 0.8367 0.1 0.1818 0.9555 0.7367 0.6548 0.1187
7.4 First Grade Connections
7.4.1 Hypernymy/Hyponymy
26
For all pairs of synsets that are in a direct relationship of hypernymy/hyponomy , the cosine
similarity and all three path related similarities were computed.
Here are the top five pair s of synset s with a s trong cosine similarit y.
Table 7. Top strong hypernym/hyponym connections according to cosine similarity
Noun1 Noun2 Cosine
Similarity Path
Similarity Wup
Similarity LCH
Similarity
homer.n.01 solo_homer.n.01 0.9040 0.5 0.9565 2.9444
trout.n.02 rainbow_trout.n.02 0.8903 0.5 0.9090 2.9444
porch.n.01 front_porch.n.01 0.8879 0.5 0.9333 2.9444
professor.n.01 associate_professor.n.01 0.8847 0.5 0.96 2.9444
cardiovascular_disease.n.01 heart_disease.n.01 0.8815 0.5 0.9411 2.9444
Here are the top five pair s of synsets that are in a hypernomy/hyponomy re lation with a
weak cosine similarity .
Table 8. Top weak hypernym/hyponym connections according to cosine similarity
Noun1 Noun2 Cosine
Similarity Path
Similarity Wup
Similarity LCH
Similarity
addition.n.02 fluoridation.n.01 -0.0984 0.5 0.9523 2.9444
seizure.n.04 impress.n.01 -0.0967 0.5 0.9411 2.9444
leadership.n.02 rome.n.02 -0.0963 0.5 0.9230 2.9444
panel.n.01 coffer.n.01 -0.0901 0.5 0.9333 2.9444
mammary_gland.n.01 dug.n.01 -0.0867 0.5 0.9473 2.9444
7.4.2 Synonymy relation s
Here are the top 5 pairs of words for strong cosine similarity and for weak cosine
similarity.
Table 9. Weak Synonyms connections according to cosine similarity
Synset Noun1 Noun2 Cosine Similarity
bus.n.01 Coach Omnibus -0.110 6
upset.n.04 Upset Swage -0.1022
taegu.n.01 Taegu Tegu -0.0934
keystone.n.02 Key Headstone -0.0874
rafter.n.01 Rafter Balk -0.0810
Table 10. Strong Synonyms connections according to cosine similarity
Synset Noun1 Noun2 Cosine Similarity
shiite.n.01 Shiite Shi'ite 0.9493
taliban.n.01 Taliban Taleban 0.9462
gaza_strip.n.01 Gaza_Strip Gaza 0.9358
united_nations.n.01 United_Nations UN 0.9331
hizballah.n.01 Hezbollah Hizbollah 0.9311
27
7.6 The Results to the Words to Add Test
The number of words to add is troublesome as the used corpus has many variations of a
word and words or expression that are not in the format required by WordNet. In the corpus, there
are a lot of proper nouns representing different names of people that appear in the Google news
corpus, different email addresses of people or institution that appear in the corpus which do es not
help the purpose of this paper to improve WordNet by adding words. Therefore, these words had to
be filtered out.
The remaining words to be added have been found in different domains: for example, in the
food related area, some new fast food items or different types of drinks that have recently become
popular have been detected.
Examples of words to add from food related area:
1. Real_Fruit_Smoothies : Is a drink where different type of fruits are smashed or juiced
depending of the fruit and served as a fresh beverage
2. Cask_conditioned_beers : Is a type of beer that is unfiltered and unpasteurized giving it a
different aroma
3. de boeuf : Is a salad which is popular in Romania mostly during holiday periods made by
different vegetables and mayonnaise.
4. chicken_chasseur : is a French recipe made by combining chicken meat with chasseur
sauce.
5. steaks_burgers : is a burger where the meat between th e loafs of breed is a steak instead
of normal meat for burgers.
6. cheese_coleslaw : Is a salad made by cabbage and cheese combined with a sauce mostly
used for picnics.
Although the corpora used was trained on Google news, not being specifically trained on
a field, several medical terms have been found. Some e xample s of medical terms are:
1. metastatic_colorectal_cancer: Is a type of colon cancer
2. anxiety_insomnia: Is a disease where the patient is having trouble sleeping because
of anxiety panic.
3. hereditary_blin dness: Is an inherited eye disease which causes blindness mostly to
small children because of their genetics:
4. medulloblastoma_malignant_brain_tumor: Is a type of cancer which forms a tumor
in the brain.
5. monozygotic_twins : Are the type of twins that in the early stage of growing they
share the same placenta
Another field where Word2Vec discovere d some new concepts to be added is the
general science field . Also, a lot of techni cal terms from computer science could be adde d.
Example of computer science terms :
1. ARM _processor – A type of processor instructions
2. SO_DIMM_memory – A type of memory for apple computers
Examples of general terms are:
28
1. glacial_meltwater : Is the phenomena of melting the glacial provoked by the global
warming – this topic has been very popular the last years, and this is the reason why
it is also present in the Google news corpus.
2. synthetic_fiber: Are fibers researched and produces by scientist to improve and
replace natural fiber provided by animals and plants.
3. copper_indium_disele nide: Is a semiconductor material with many applications in
solar energy.
4. optical_biosensors : Is a specific type of biosensor used for detection and analyzing of
different chemical substances.
5. magnetic_resonance_imaging : Is a medical procedure used for sc anning the body of
a patient with the purpose of detecting hard to find tumors.
7.7 Web Application description
The web application is divided in eight tabs (see Fig. 7), two tabs that allow search and
visualization using a n approach similar to Visuword s2, one tab is allowing the search of synset nouns,
one tab allow s the search of synset verbs. The other six tabs are for visualization of different
relevant data that is cached .
Figure 7. Tabs view open on top 10 Wu Palmer Noun s
The first tab is a tab where the user has the possibility to input a noun synset and at the click
of the search button, if that noun synset exist , a graph appears with the top connections that have
major differences between the similarities computed us ing Word2Vec and WordNet (see Fig. 8) . The
connections that are stronger in Word2Vec are colored in blue , so that we know that the specific
connection need to be added, either directly or the path between the two synsets needs to be
shortened by adding a c onnection between other concepts. The connections that are strong in
WordNet but weak in Word2Vec are colored in red, so that we know that for those connection there
is a possibility to be deprecated and should be removed.
The second search tab is simila r with the first one, the only difference being that in this tab
the quer ies are made on the verb synsets set.
The other tabs were made to make it easy to point out major difference between the
similarities from WordNet and Word2Vec.
2 Visuwords – https://visuwords.com/content
29
8. Use Cases
Figu re 8. Example run noun search for bridal_wreath.n.0 1
In Fig 8. we see a demo of the web interface for nouns search for the synset
bridal_wreath.n.01 . The graph shows that there are a lot of connections that should be added for
that synset.
30
Figure 9. Example run noun search for imperialism.n.01
In Fig 9. we see a demo of the web interface for nouns search for the synset imperialism .n.01. The
graph shows that there are four connections should be made st ronger with the following synsets:
colonialist.n.01, hegemony.n.01, capitalist.n.01, neol iberal.n.01, imperialist.n.01.
31
9. Analysis of results
9.1 Path analysis
The path similarity is the most rudimentary from the three path related similarities in
WordNet . The similarity value is given considering the shortest path between the two synsets. For
example, two synsets that are in a hyponymy/hypernymy relationship have a path similarity equal to
0.5. The only distance bigger than 0.5 is the distance between the same synsets which is 1.
Because of this rudimentary technique this similarity has the highest difference when
compared with cosine similarity being equal to 0.6388 .
9.2 Wu Palmer analysis
Wu Pal mer similarity is the one that gave the best result s at a human perception level.
Most relevant connections have been tagged by analyzing the MySQL tables that were created
considering the absolute difference between Wu Palmer Similarity and Cosine Similarity.
This similarity is closer to cosine similarity than path similarity being more accurate. The
difference between cosine similarity and Wu Palmer Similarity is 0.2519.
9.3 Leacock -Chodorow analysis
Leacock -Chodorow Similarity is the closest one to the cosine similarity. When normalized the
difference be tween Leacock -Chodorow Similarity and cosine similarity are smaller than 0.2.
The difference between cosine similarity and Leacock -Chodorow similarity is 0.178 7.
9.4 Comparison between path related similarities
In order to compute the similarity betw een WordNet Si milarity and Cosine Similarity , all
semantic distances had to be normalized to the [0, 1] interval: Wu Palmer Similarity is in the [0, 1]
interval so no step had to be made; Path Similarity is in the [0, 1] interval so also no step had to be
made; Cosine Similarity in in the [ -1, 1] interval so the values have been normalized to the [0,1]
interval ; Leac ock – Chodorow is not in the [0,1] interval so we to ok as a maximum value the val ue
found as 2.944 and as minimum value the value found 0.8 and we normalized all the distances in the
[0,1] interval.
Table 11. Comparison between square difference o f path related similarities
Word Net Path Similarity Square difference with Cosine Similarity
Wu Palmer Similarity 0.2519
Path Similarity 0.6388
Leacock -Chodorow 0.1787
32
9.5 Improvements of WordNet
After analyz ing all these result s, there have been a few domains where we noticed weak
WordNet connections w here the cosine similarity was strong, either because the connection did not
exist (there was no hyperny my/hiponomy relation) or because the concepts are related but not with
WordNet like connections.
One field where Word2Vec analysis revealed a few connections that should be added in
WordNet either direct ly or in the path between concepts is the medical field.
Another field w here a few connections were missing is the food related concepts, either
different types of food are not correctly subordinated, or the ingredients are not hyponyms of the
food type.
Flora structure in WordNet would also require a review, quite a few relationships between
different plants were found by the Word2Vec method where one plant was similar with another but
in the WordNet taxonomy they were unrelated.
Examples to back that affirmation:
The pair (eggplant.n.01, zucchini.n.01 ): Both eggplant and zucchini are related through
shape, method of coo king but they are not related in WordNet.
The pair ( potentilla.n.01 , snowberry.n.01 ): Are two flower species widely use d as a wedding
decoration toget her but are unrelated in WordNet.
The pair ( cabbage.n.01, cauliflower.n.01 ): Both are part of the Brassica oleracea species
which includes cabbage, broccoli, cauliflower. They have a strong cosine similarity but weak
WordNet Similarity.
In the medical fi eld, several connection s have been discovered by Word2Vec such as:
connections between disease and symptom, connection between disease and part of body affected,
connections between disease and m edicine which are not in WordNet.
For Example:
The Pair ( acetabulum .n.01, osteochondroma.n.01 ): Osteochondroma is a type of benign
tumor that starts in the bones. Acetabulum is a part of the pelvis bone and the place where that kind
of tumor often starts.
The Pair ( cefotaxime.n.01, parasitemia.n.01 ): Cefotaxime is an antibiotic used to treat
bacterial infections. Parasitemia is a condition where parasites are present in the body. This is an
example of cure -disease connection found by Word2Vec.
The Pair (amino.n.01, tyrosine.n.01 ): Tyrosine is a type of amino a cid, so there should be
added a hyponymy/hypernymy relation in WordNet.
33
The Pair (thromboembolism.n.01, myocardial_infarction.n.01 ) Thromboembolism is the
obstruction of a blood vessel and is one of the cause s for myocardial_infarction com monly known as
an infarct.
In the food area there are also several connection s that should be made more powerfu l,
foods that are similar or share many ingredients.
For Example:
The Pair (burger.n.01, cheeseburger.n.01 ): Chees eburger is a ty pe of burger where a slice of
cheese is added. A relation of hype rnymy/hyponomy should be added.
9.6 Table with overview of words added, connections added/ cut
After manually reviewing a part of the results , these are some statistics about what changes
we recommend for WordNet.
Tabl e 12. Overview of reviewed improvement
Number of words added Number of connections added Number of connections
deleted
49 23 15
For the Number of words added test, five hundred words that were tagged by word2vec
were manually rev iewed and the conclusion was that only 54 words are relevant to be added in
WordNet, giving Word2Vec a 10% relevance preci sion on words detected with connections in the
WordNet database.
For the number of connection added, the connections have been sorted in decreasing order
after the cosine similarity value, and relev ant connections have been tagged. The same process was
used to detect connections that should be deleted.
34
10. Conclusions and Future Work
In conclusion, we can say that using Word2Vec could lea d to many ways to update
WordNet, each of them with future possibilities to be improved.
From the way that Word2Vec vector space is created we see that words that have similar
context, have different kind of connections that WordNet has not implemented, f or example the
disease -medicine connection.
Considering the existing type of connections , Word2Vec can detect new connection s that
escaped the human annotators who were responsible for adding those connection s, as is the case,
for example , for the hypeny my/hyponomy connection between burger.n.01 and cheeseburger.n.01.
Old connection that are not relevant anymore are can be pointed out by comparing
WordNet with Word2Vec vectors trained on text corp ora that are relevant to the current view of the
world . One such example is the connections between Romania.n.01 an d sultanate.n.01 , which is too
powerfu l nowadays .
WordNet can be enhanced by increasing the number of words/concepts by comparing
words trained by Word2Vec with existing words in WordNet. One way t o imp rove this step is by
creating a mechanism to reject words that are proper nouns or are declinations of other words to
make the work easier for the human who w ill check the lists provided by Word2Vec. For this test
phase , we considered the top 500 word s tagged by Word2Vec as having co nnection s with WordNet
concepts and only 54 of them were relevant (may be added to WordNet) .
What is covered in this paper is merely a proof of concept that Wo rd2Vec has the power to
detect with certain probabilities fault ed or deprecated manually introduced data from WordNet.
To increase the probability of detecting connections or adding new concepts using
Word2Vec, one important factor is the corpus on which the Word2Vec method is trained to create
the numeric vector s. For example, in order to improve WordNet medical related concept , a high
number of medical books, patient studies, diseases reviews can be processed in order to decrease
the noise in the Word2Vec vectors of medical terms.
Another problem in this study was the high number of misses, meaning words that are not
both in WordNet and Word2Vec , mak ing it difficult to compar e the similarities provided by each
resource . By concentrating the analysis on a specific field (medical, food related, flora, fauna, certain
actions), collecting data for training Word2Vec vectors in that specific area would have two
important result:
1. Decrease in the number of misses.
2. Improve the precision of Word2Vec vectors.
An application can be created where the user can specify a domain in which (s)he wishes to
review the WordNet concepts and connections. After this pick, (s)he can choose to upload training
files for Word2Vec or can select “automated data gathering”, where a special crawler can select
specialized papers using Google scholar and other websites specialized in scientific papers and use
Word2Vec to create numeric vectors.
35
After these steps, the user will be able to obtain several results regarding the number of
concepts in WordNet related to that domain compared to the number o f concepts discovered by the
Word2Vec analysis and also about the connections from WordNet compared to the ones from
Word2Vec.
Basically, what this application should do is an automatized process of the steps made in
certain parts manual ly in this paper.
Word2Vec is not the only method to create numerical vectors for words, and thus another
way to improve the results is to create vectors for the same corpus using several other methods for
obtaining embedded vectors for words, and compare the results obta ined.
A future study could test if computing the average of the distances obtained from multiple
vector word types could increase the final precision, by performing the same tests as in this pape r.
36
Bibliography
[1] George A. Miller (1995). WordNet: A Lexical Database for English. Communications of the ACM
Vol. 38, No. 11: 39 -41.
[2] B Liu . Morgan & Claypool Publishers (2012 ). Sentiment analysis and opinion mining.
[3] Mohsen Pourvali and Mohammad Saniee Abadeh (2012) . Automated Text Summarization Base
on Lexicales Chain and graph Using of WordNet and Wikipedia Know ledge Base.
[4] Gao N., Zuo W., Dai Y., Lv W. (2014) . Word Sense Disambiguation Using WordNet Semantic
Knowledge. In: Wen Z., Li T. (eds) Knowledge Engineering and Management. Advances in Intelligent
Systems and Computing, vol 278. Springer, Berlin, Heide lberg
[5] A. G. Tapeh and M. Rahgozar (2008) . “A knowledge -based question answering system for B2C
eCommerce”, Know l.-Based Syst., vol. 21, no. 8 .
[6] Pavel Surmenok (Nov 5, 2016 ) Natural Language Pipeline for Chatbots [online, accessed on
01.07.2017] http s://hackernoon.com/natural -language -pipeline -for-chatbots -897bda41482
[7] Lingling Meng, Runqing Huang and Junzhong G u (2013) . A Review of Semantic Si milarity
Measures in WordNet .
[8] T. Mikolov, K. Chen, G. Corrado, and J. Dean (2013) . Efficient Estimati on of Word Representations
in Vector Space. In Proce edings of Workshop at ICLR .
[9] Peter Thiel (2014) . From zero to one , Chapter Complementary Businesses
[10] Chris McCorm ick ( 27 Apr 2016) Word2Vec Resources [online, accessed on 01.07. 2017]
http://mccormi ckml.com/2016/04/27/word2vec -resources/
[11] Tomas Mikolov, Ilya Sutskever , Kai Chen , Greg Corrado , Jeffrey Dean (2013 ). Distributed
representations of words and phrases and their compositionality Proc. 27th Annual Conference on
Neural Inform ation Processi ng Systems.
[12] https://www.researchgate.net/figure/228957479_fig2_Fig -2-An-example -of-WordNet -nouns –
taxonomy [online, accessed on 01.07.2017]
[13] Takuya Hira , Takahiko Suzuki , Nao Wariishi and Sachio Hirokawa (2015) . Vector Similarity of
Related Words and Synonyms in the Japanese WordNet
[14] Abram Handler (2014) . An empirical study of semantic similarity in WordN et and Word2Vec
[15] F. Bond, H. Isahara, S. Fuji ta, K. Uchimoto, T. Kuribayashi (2009). Enhancing the Japanese
WordNet, ALR7 Proc. the 7th Workshop on Asian Language Resources, pp. 1 -8 ,Association for
Computational Linguistics. pp. 1 -8
[16] Francis Bond, Takayuki Kuribayashi, Hitoshi Isahara, Kyoko Kanzaki, Kiyotaka Uchimoto, Kow
Kuroda, Masao Utiyama, Darren Cook. Japanesse WordNet [online, accessed on 01.07.2017]
http://compling.hss.ntu.edu.sg/wnja/index.en.html
[17] Bird, Steven, Edw ard Loper and Ewan Klein (2009). Natural Language Processing with Python.
O’Reilly Media Inc.
[18] Radim Rehurek, Gensim free python library for scalable statistical semantics [online, accessed
on 01.07.2017] https://radimrehurek.com/gensim/
[19] http://www.numpy.org/ [online, accessed on 01.07.2017]
[20] http://visjs.org/ [online, accessed on 01.07.2017]
37
Annex 1. Words to be added to WordNet
Word to be added – Strong connections with WordNet
1.metastatic_colorectal_cancer – [u'multiple_myeloma', u'carcinoma']
2. anxiety_insomnia – [u'polyuria', u'insomnia', u'paraesthesia', u'sleeplessness']
3. hereditary_blindness – [u'myotonic_dystrophy', u'degenerative_di sorder']
4. medulloblastoma_malignant_brain_tumor – [u'erythema_nodosum', u'generalized_epilepsy']
5. Wild_Mushrooms – [u'salsify', u'celery_root', u'chicken_tetrazzini', u'worcestershire_sauce]
6. tight_tolerances – [u'strain_gage', u'leadless', u'toroida l']
7. metalworking_fluid – [u'zirconium_dioxide', u'portland_cement', u'powder_metallurgy']
8. jointed_legs – [u'premolar', u'condyle', u'caudal_fin', u'stretchability']
9. nail_lacquer – [u'potassium_alum', u'nail_varnish', u'spf']
10. hydroponically_gro wn – [u'medlar', u'webworm', u'anise_hyssop', u'fleabane', u'perilla']
11. feral_colonies – [u'unadoptable', u'peafowl', u'cowbird', u'nonmigratory]
12. NoiseGuard – [u'leadless', u'pyrometer', u'sunblind', u'omnidirectional',u'noiseless']
13. liquefied_am monia – [u'monoxide', u'phthalic_anhydride', u'isobutylene']
14. deodorant_shampoo – [u'dental_floss', u'terrycloth', u'castile_soap']
15. antimatter_atoms – [u'soliton', u'hydrophobicity', u'subatomic_particle', u'spicule]
16. fermented_soybean – [u'hydro genate', u'peroxidase', u'boswellia', u'monosaccharide']
17. stimulatory_effects – [u'norethindrone_acetate', u'hyoscyamine', u'motilin', u'anabolism']
18. cheese_coleslaw – [u'salsify', u'coconut_cream', u'tourtiere', u'deviled_egg']
19. webbed_toes – [u'ameba', u'ulcerate', u'bulgy', u'prehensile']
20. dental_veneers – [u'dental_implant', u'prosthodontic', u'partial_denture', u'orthodontic']
21. SO_DIMM_memory – [u'nonvolatile_storage', u'flash_memory', u'crystal_oscillator']
22. optical_biosensors – [u'radioprotection', u'macromolecular', u'photoelectron']
23. Real_Fruit_Smoothies – [u'chicken_sandwich', u'tamale_pie', u'scallopini']
24. cask_conditioned_beers – [u'tasting', u'varietal_wine', u'pilsener', u'wine']
25. de_boeuf – [u'woodiness', u'chipotle' , u'chervil', u'bowtie_pasta', u'fudge_sauce']
26. chicken_chasseur – [u'peppermint_patty', u'roast_lamb', u'smoked_eel']
27. steaks_burgers – [u'porterhouse_steak', u'steak_au_poivre', u'fricassee']
28. facial_scrub – ['sodium_lauryl_sulphate', u'exfolia te', u'witch_hazel]
29. Calcium_fortified – [u'brickle', u'spanish_rice', u'pantothenic_acid', u'cardamon']
30. bendy_bus – [u'zebra_crossing', u'motorway', u'dustcart']
31. buttery_caramel – [u'demitasse', u'strawberry_jam', u'chive', u'strawberry_ice_cre am']
32. insulating_substrate – [u'membrane', u'magnetization', u'microcrystalline' ]
33. rhino_beetle – [u'sacred_ibis', u'palm_civet', u'carpenter_ant']
34. spinal_fusion_procedure – [u'partial_denture', u'embolectomy', u'implant', u'topical_anesthesia']
35. bread_crumb_topping – [u'phyllo', u'orange_marmalade', u'sabayon', u'lemon_rind’]
36. Drug_Application – [u'valdecoxib', u'tobramycin', u'thrombolytic_agent']
37. homemade_sausages – [u'florentine', u'melba', u'piccalilli', u'smoked_haddock']
38. ther monuclear_fusion – [u'nuclear_fission', u'nucleosynthesis' , u'thermonuclear']
39. migratory_songbird – [u'sacred_ibis', u'wheatear', u'mouflon', u'rusty_blackbird']
40. ceramic_brakes – [u'gearset', u'sunblind', u'brake_pad', u'hp']
41. glacial_meltwater – [u'congenital_anomaly', u'cervix_uteri', u'immunodeficient' ]
42. ARM_processor – [u'multiprocessing', u'microprocessor', u'silicon', u'processor']
43. synthetic_fiber – [u'plasticiser', u'butyl_rubber', u'bast_fiber', u'phenolic_resin' ]
44. glacial_mel twater – u'glacier', u'lobate', u'water_vapor'
45. magnetic_resonance_imaging – [u'computed_tomography', u'imaging', u'ultrasonography' ]
46. bone_marrow_biopsies – u'immunopathology', u'arthropathy', u'osteopetrosis']
47. reduce_nonproductive_in flammation – [u'haematopoietic', u' cyclooxygenase', u'multipotent']
48. omelets_pancakes – [u'quesadilla', u'caramel_apple', u'fudge_sauce']
49. myocardial_infarction_stroke – [u'erythema_nodosum', u'pericardial', u'myocardial']
38
Annex 2. Connections to be added to WordNet
1. burger.n.01 – cheeseburger.n.01 [ hyponymy – hyponymy relationship ]
2. dental_implant.n.01 – periodontics.n.01 01 [ hyponymy – hyponymy relationship]
3. electron_accelerator.n.01 – quantum_chromodynamics.n.01
4. wine.n.01 – pinot_noir.n.01 [ hypo nymy – hyponymy relationship]
5. lobster.n.01 – scallop.n.01 [ hyponymy – hyponymy relationship]
6. sesame_oil.n.01 – soy_sauce.n.01 [coordinate terms]
7. gnocchi.n.01 – risotto.n.01 [coordinate terms]
8. adagio.n.01 – allegro.n.01 [coordinate terms]
9. racial_segregati on.n.01 – segregation.n.01 [ hyponymy – hyponymy relationship]
10. coral.n.01 – coral_reef.n.01 [ hyponymy – hyponymy relationship]
11. hesitance.n.01 – reluctance.n.01 [mero nymy – holonymy relationship]
12. slalom.n.01 – downhill.n.01 [ hyponymy – hyponymy relations hip]
13. processor.n.01 – silicon.n.01 [ hyponymy – hyponymy relationship]
14. flash_memory.n.01 – memory.n.01 [ hyponymy – hyponymy relationship]
15. spokesman.n.01 – dalton.n.01 [ hyponymy – hyponymy relationship]
16. widebody_aircraft.n.01 – avionics.n.01 [ hyponymy – hyponymy relationship]
17. aspen.n.01 – chlorosis.n.01 [meronymy – holonymy relationship]
18. twinjet.n.01 – avionics.n.01 [ hyponymy – hyponymy relationship]
19. gastronomy.n.01 – haricot.n.01 [ hyponymy – hyponymy relationship]
20. purple_loosestrife.n.01 -poison_ivy. n.01 [coordinate terms]
21. castor_bean.n.01 – fusarium_wilt.n.01 [ hyponymy – hyponymy relationship]
22. cosmetic_dentistry.n.01 – dermatologist.n.01 [ hyponymy – hyponymy relationship]
23. bipolar_disorder.n.01 – schizophrenia.n.01 [coordinate terms]
39
Annex 3. Conn ections to be deleted from WordNet
1. region.n.01 – hell.n.01 [ direct hypernymy/hyponomy relationship]
2. acting.n.01 – heroics.n.01 [ direct hypernymy/hyponomy relationship]
3. relation.n.01 – foundation.n.01 [ direct hypernymy/hyponomy relationship]
4. helping.n.01 – drumstick.n.01 [ direct hypernymy/hyponomy relationship]
5. structure.n.01 – shoebox.n.01 [ direct hypernymy/hyponomy relationship]
6. condition.n.01 – silence.n.01 [ direct hypernymy/hyponomy relationship]
7. device.n.01 – key.n.01 [ direct hypernymy/hyponomy r elationship]
8. pedestrian.n.01 – marcher.n.02 [ direct hypernymy/hyponomy relationship]
9. set.n.01 – threescore.n.01 [ direct hypernymy/hyponomy relationship]
10. power.n.01 – repellent.n.03 [ direct hypernymy/hyponomy relationship]
11. romania.n.01 – sultanate.n.01 [ it is not direct but the path should be shortened]
12. romania.n.01 – billings.n.01 [ it is not direct but the path should be shortened]
13. technicality.n.01 – mealtime.n.01 [ it is not direct but the path should be shortened]
14. crack_addict.n.01 – fourier.n.01 [ it is not direct but the path should be shortened]
15. frenchman.n.01 – spartan.n.01 [ it is not direct but the path should be shortened]
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: Diploma project [606593] (ID: 606593)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
