A System For Detecting Professional Skills From Resumes Written In Natural Language

METHOD BASED ON JAPE RULES AND WORDNET FOR DETECTING PROFESSIONAL SKILLS FROM RESUMES WRITTEN IN NATRUAL LANGUAGE

2016

Graduate: Vlad Mircea NESTE

A SYSTEM FOR DETECTING PROFESSIONAL SKILLS FROM RESUMES WRITTEN IN NATRUAL LANGUAGE

Project proposal: Method based on JAPE Rules and WordNet for detecting Professional Skills from Resumes Written in Natural Language

Project contents: Introduction, Project Objectives, Bibliographic Research, Analysis and Theoretical Foundation, Method based on JAPE Rules and WordNet for detecting Professional Skills from Resumes Written in Natural Language, Detailed Design and Implementation, User’s Manual, Conclusions, Bibliography and Appendix.

Place of documentation:Technical University of Cluj-Napoca, Computer Science Department

Consultants: Assoc. Prof. Eng Viorica CHIFU

Date of issue of the proposal: November 1, 2015

Date of delivery: June 30th, 2016

Declarație pe proprie răspundere privind

autenticitatea lucrării de licență

Subsemnatul(a) Vlad Mircea NESTE, legitimat(ă) cu CI seria SM nr. 435605 CNP [anonimizat], autorul lucrării A SYSTEM FOR DETECTING PROFESIONAL SKILLS FROM RESUMES WRITTEN IN NATURAL LANGUAGE. METHOD BASED ON JAPE RULES AND WORDNET FOR DETECTIN PROFESSIONAL SKILLS FROM RESUMES WRITTEN IN NATURAL LANGUAGE. elaborată în vederea susținerii examenului de finalizare a studiilor de licență la Facultatea de Automatică și Calculatoare, Specializarea Calculatoare Engleza din cadrul Universității Tehnice din Cluj-Napoca, sesiunea _________________ a anului universitar 2015-2016, declar pe proprie răspundere, că această lucrare este rezultatul propriei activități intelectuale, pe baza cercetărilor mele și pe baza informațiilor obținute din surse care au fost citate, în textul lucrării, și în bibliografie.

Declar, că această lucrare nu conține porțiuni plagiate, iar sursele bibliografice au fost folosite cu respectarea legislației române și a convențiilor internaționale privind drepturile de autor.

Declar, de asemenea, că această lucrare nu a mai fost prezentată în fața unei alte comisii de examen de licență.

In cazul constatării ulterioare a unor declarații false, voi suporta sancțiunile administrative, respectiv, anularea examenului de licență.

De citit înainte (această pagină se va elimina din versiunea finală):

Cele trei pagini anterioare (foaie de capăt, foaie sumar, declarație) se vor lista pe foi separate (nu față-verso), fiind incluse în lucrarea listată. Foaia de sumar (a doua) necesită semnătura absolventului, respectiv a coordonatorului. Pe declarație se trece data când se predă lucrarea la secretarii de comisie. Formatarea actuala, cu Section Break (Odd Page) la finalul fiecăreia dintre aceste pagini, asigura acesta cerință.

Pe foaia de capăt, se va trece corect titulatura cadrului didactic îndrumător, în engleză (consultați pagina de unde ați descărcat acest document pentru lista cadrelor didactice cu titulaturile lor).

Documentul curent a fost creat în MS Office 2013. Dacă folosiți alte versiuni e posibil sa fie mici diferențe de formatare, care se corectează (textul conține descrieri privind fonturi, dimensiuni etc.).

Cuprinsul începe pe pagina nouă, impară (dacă se face listare față-verso), prima pagina din capitolul Introducere tot așa, fiind numerotată cu 1. Pentru actualizarea cuprinsului, click dreapta pe cuprins (zona cuprinsului va apare cu gri), Update field->Update entire table.

Vizualizați (recomandabil și în timpul editării) acest document după ce activați vizualizarea simbolurilor ascunse de formatare (apăsați simbolul  din Home/Paragraph).

Fiecare capitol începe pe pagină nouă, datorită simbolului ascuns Section Break (Next Page) care este deja introdus la capitolul precedent. Dacă ștergeți din greșeală simbolul, se reintroduce (Page Layout -> Breaks).

Folosiți stilurile predefinite (Headings, Figure, Table, Normal, etc.)

Marginile la pagini nu se modifică (Office 2003 default).

Respectați restul instrucțiunilor din fiecare capitol.

Pentru a evita eventuale erori de formatare in timpul listării, va sugeram sa va salvați versiunea finala a acestui document in format PDF.

Chapter 1. Introduction 1

Chapter 2. Project Objectives 3

2.1. Problem specification 3

2.2. General Objectives 3

2.3. Functional requirements 4

2.4. Non-Functional requirements 5

2.4.1. Usability 5

2.4.2. Performance 5

2.4.3. Security 6

2.4.4. Scalability and extendibility 6

Chapter 3. Bibliographic Research 7

Chapter 4. Analysis and Theoretical Foundation 15

4.1. Problem Analysis 15

4.1.1. Flow of events 15

4.1.2. Product features 15

4.2. Use cases 15

4.2.1. Use Case: User Sign Up 16

4.2.2. Use Case: User Login 16

4.2.3. Use Case: CV Creation 17

4.3. Data Modeling 18

4.3.1. O ntologies 18

4.3.2. WordNet 24

4.3.3. GATE 26

Chapter 5. Method based on JAPE Rules and WordNet for Detecting Professional Skills from Resumes Written in Natural Language 28

5.1. Crawler 28

5.2. Input data pre-processing and JAPE rule definitions 29

5.3. Skill Detection 32

5.3.1. Skill Detection Algorithm 33

Chapter 6. Detailed Design and Implementation 35

6.1. System Architecture 35

6.2. Persistent storage and data access layer 36

6.3. Business Layer 40

6.3.1. Skill Detection Component 40

6.3.2. Ontology Update Component 44

6.3.3. User Managemt Component 46

6.3.4. System Security 47

6.4. Presentation Layer 48

Chapter 7. Testing and Validation 50

7.1. Metrics 50

7.2. System validity 52

7.3. Experimental Results 54

7.4. Methods comparison 58

Chapter 8. User’s manual 62

8.1. Systems installation 62

8.2. System usage 64

Chapter 9. Conclusions 70

9.1. Contributions and achievements 70

9.2. Results 70

9.3. Further Development 70

Bibliography 72

Appendix 1 74

Introduction

In the next pages we will discuss one major problem with which mostly everyone encounters, writing the perfect CV! The one that presents who you truly are, what skills and competencies you achieved in your lifetime and the one that assures every employer that you are suitable for the position needed by him.

The hiring manager is the buyer, you’re the product, and you need to give him a reason to buy, this being a common way of thinking for managers or people that hire on a daily basis.

“There’s nothing quick or easy about crafting an effective resume”, says Jane Heifetz, a resume expert and founder of Right Resumes. Don’t think you’re going to sit down and hammer it out in an hour. “You have to think carefully about what to say and how to say it.’’[1]

For many years people tried to resolve the gap between what people want to say what they actually say and one study shows the main issue when starting to do a CV. ‘’The Employers Skill Survey” follows up the 1999 survey of the same name, commissioned as part of the program of research to support the work of the National Skills Task Force. It presents the main situations that need a solution: skill shortage vacancies and internal skill gaps. The first one refers to a sub-set of hard to fill vacancies, which we can assume that are hardly to find on people because they are poorly skilled, inexperienced or under qualified and the other one refers to areas that the employer perceives to be below the desired level of proficiency wanted by the company. This two also having an impact on business performance, loss of orders or delays developing new products or services may be considered to be severe impacts [2].

Another huge problem are the writing skills: high frequency of grammatical errors, lack of variety in grammatical structures, use of inappropriate vocabulary or limited range of vocabulary, poor spelling, inadequate presentation and deficiency in clear self-expression are just some examples. It is now enough to know yourself, you have to be able to present in a CV who you are in a manner that makes the person who reads it to truly understand your skills, competencies, work experience and overall to show him good quality in everything, even in your writing skills.

Haifa Al- Buainain, Associate Professor at Quatar Univerity made a study to understand in depth the writing errors made by students. It can be seen from an analysis that the students ‘’performance errors are systematic and classifiable and it’s the teachers responsibility to adopt, modify or even develop remedial procedures that can minimize them.’’[3] Exactly what our tool does, it helps minimize errors by language when the end user is trying to express itself in an overrated approach.

The main idea is that we shouldn’t focus on writing skills because we could lose the essence of what we want to transmit, which is much more important. The skill shortage vacancy and the internal skill gap can be easy to close if people would show what they really know and what their skills are.

An example of perfect CV is the one of Mike Love’s state-of-the-art one, his advice is simple: “a CV should no longer be a dry list of professional experience. Today the more narrative CVs are more personal. They should read like a story and not a set of ingredients – the ‘why’, ‘how’ and ‘who’ rather than the ‘what’ and the ‘when’.” No employer wants to see fancy fonts, unnecessary details, wordy waffle, quirky comments or candidates trying to be funny. A good CV should be: checked, checked and checked again!’’[4]. Similar advices from professionals all over the world encourage to present yourself and actually close the gap just by knowing how to suit yourself for the position wanted.

“Use professional language, sign off any correspondence formally and make sure there are no typos! This is my — and most likely lots of other employers’ — number one pet peeve, so ask a friend or family member to help proof read your CV for you before you hit send.” Says Karren Brady, LifeSkills Ambassador also Sue Benson, managing Director at The Market Creative tells us that: “We look for concise articulation of what candidates have done brilliantly and what their skills are, as opposed to just what work they’ve been involved in. Highlighting relevant information at the top of a CV will help, rather than burying it on the second page. Some CVs stand out because they might be visually engaging. And the writing has to be brilliant.” and last but not least, Ashok Vaswani, CEO Corporate and personal banking help us to understand what a CV should have: “Your CV is your first and golden opportunity to highlight your key strengths, skills and experience to a potential employer. Employers receive hundreds of CVs a day, so make their lives easier by making yours stand out. Use clear, concise and confident language. Don’t forget that employers can also sometimes check your social footprint, so present your best self on Twitter and LinkedIn, and use these channels to your advantage — they are an extension of you so make sure they truly reflect that.”[4] Once you've sorted out the content and the visuals, the final step is to make sure the length of your CV is right. If your CV is too long, instead of grabbing the employer's attention (remember, he or she potentially has 399 other CVs to go through), you immediately lose it. A CV should be an average of two pages long – an overly long CV is boring and time-consuming, and a CV that is too short immediately suggests that you don't have enough experience, which could potentially put you out of the race. [5]

To summarize the main idea and the main thing, on which we focus, is communication. Communication is the key. Competence in oral communication – in speaking and listening is prerequisite to students' academic, personal, and professional success in life. As individuals mature and become working adults, communication competences continue to be essential. Communication skills are required in most occupations. Employers identify communication as one of the basic competencies every graduate should have, asserting that the ability to communicate is valuable for obtaining employment and maintaining successful job performance. The communication skills essential in the workplace include basic oral and writing skills, and the ability to communicate in work groups and teams with persons of diverse background, and when engaged in problem solving and conflict management. [6]

Our tool tries to solve all of this situations with which people confront when starting to write a CV. Understanding both the necessities of the employee and the employer and making it easy to obtain a CV that will present the true you and satisfy all the complex requirements

Project Objectives

Problem specification

Study had proven that writing a CV is not as easy as we think, and this gap is produced by the fact that we are trying to overdose what we truly are and what we truly know about ourselves. A CV is the first indirect handshake between you and an employer, the first test to past went you want to present you knowledge, so beside the fact that you really want to impress you would also like that the person that reads you CV knows what you are capable of.

The concept of software that helps you create your resume is hardly a new one. There are numerous software systems available worldwide that guide you from a structure point of view. In the case of such an approach, the end user still encounter the big problem of how to describe what he really knows. The majority of the end users have pour knowledge about writing a consistent and also precise resume. Beside resume we have letter of application, that again we tend to write it in a belletristic way without taking into account that we are not evaluated for the writing skills, we are evaluated for the skills we have and how we can leverage them in our favor and in company interest.

Above we presented the end user problem statement, but lets do not forget about human resources departments inside companies that are responsible of choosing and hiring the best staff. Big companies are receiving in average one letter of application per minute this means 4,800 letters of application in a business day. Beside the fact that reading each letter of application is very time consuming, we as humans can make mistake and we could skip some key elements from that letter of application that make that candidate superior from all the others. We suggest a software solution that extract the skills from a letter of application and present only the necessary information. In this way human resources departments can filter the incoming letters and read only the most appropriate ones to the company requirements.

Having this problems stated made us think to a solution that helps both end users but also companies.

General Objectives

The objective of this project is to offer support for persons that want to write an effective and precise resume or letter of application and for filtering the existing letter of applications based on the knowledge required.

The system will create a tailored resume based on the input information introduced by the end user. This information include personal characteristics, information about the knowledge the user posses and about other personal objectives.

To offer support the system will generate the exact resume by extracting only the necessary information from the input data and in final the user will receive the extracted data in a structured and formal output. This solution also facilitates users that do not posses outstanding English language skills, by discarding unrelated information.

The system was designed to grow together with the users, that enters inputs, it is supposed to learn new skills from each user individually.

To meet this objectives, the approach chosen was based on natural language processing, due to the fact that the user is not restricted to input data into a precise structure or to use restrictive words. From a performance point of view the challenge was to discover from the input text only the relevant information, the skills. Because we are talking about words or constructions that represent skills, was decided to use an ontology of skills. Ontologies offer a parental structure, but a new challenge we faced was for the new skills we detect what should we do with them? We decided to use WordNet, an online English taxonomy that offers more information about the words we are testing, in this way we accomplish the requirement of updating the ontology for every new skill that is added by the user, so the system is updating and growing with every user. Another difficult challenge that was on our objective list, was the problem of skills that are not in the ontology and nor in the WordNet. This problems is quite difficult because we cannot detect the parent in the ontology hierarchy were the skill will be added. The solution we cam we cam up with was to use the context of the input and to provide the user with the possibilities we think are useful.

Functional requirements

From a software engineering point of view a function of system and its members is considered to be a functional requirement. In other word, a function is seen as a set of inputs, a custom transformation inside the function and outputs for that function. In the same way, functional requirements define what is the expected result and the accomplishment of a system.

The functional requirements we aim to accomplish with our system are based on the previous chapter descriptions. First of all we target to receive the input from end user that expect a resume in exchange. After we have the input data the system should detect with high precision the skill associated to the input data and if not all the skills are detected to provide alternatives solutions.

From a user point of view, at any time you can change you personal details, at any time you can save and aboard the process and at any time you can retrieve to the last step, or you can save the outputted data into a text structure format.

Figure 1.1 Use Case Diagram

A use case diagram is a representation of user interaction with the system outlining the relationship between the use cases in which a user is involved and the user itself. Use case diagrams have huge benefit because is visual representation, and are precise and clear in contrast with any theoretical documentation. Below you will see a simple representation of the system flow proposed.

Figure 1.2 System Flow

Non-Functional requirements

When we use a certain software we are interacting the first time with the non-functional requirements that consist of quality characteristics and attributes. We are using this metrics to make a general evaluation of the system in use. Non-functional requirements are the details to which many stakeholders have an interest to accomplish as well as possible. In the following paragraphs we are going to see some of the non-functional requirements of our system.

Usability

We want to create this software solution as straight forward as possible. The many requirements that should follow are: the interface should be very intuitive and very descriptive, easy to use and to constantly provide feedback for the action the user is taking care of.

Performance

We are tackling in this solution large corpuses of language elements like words or constructions that represent skills. In other words we are preprocessing large data, so our system should follow some performance requirements to make the user experience as fluid as possible. Some of that requirements are: response time, quality of the data returned and minimal effort in introducing the data.

Because we are not always referring to computer science stakeholders the response time is crucial, in order this solution to be liked and used. The system time needed to extract the skills from the resume but also finding the right parents and testing with the external resources like the ontology or WordNet are key elements in the design.

Security

The system works with personal and sensitive data, so security is again very important. Because of this the system provides data integrity and confidentiality by introducing a layer of authentication and authorization. The user can retrieve or access the system profile only authenticating with right username and password.

Scalability and extendibility

The system is designed for worldwide users, so the scalability and possibility of extension are very important. More users mean more skills and more input data that reduce in more resumes saved on the system.

Due to the scalability requirements each module are independent and can be very easy modified to be extended or improved without affecting all the others.

Bibliographic Research

Different approaches and studies of other specialists help us understand the importance of ontology in information extraction. In the next pages we will present works that helped us understand it. Fernando Gutierrez, Dejing Dou, Stephen Fickas, Daya Wimalasuriya and Hui Zong are the authors of "A Hybrid Ontology-based Information Extraction System" [7] and claim that information retrieval is not an easy work to do because of the ambiguity of natural language. Useful in this approach for this case is Ontology-based Information Extraction (OBIE) which could ease this process. Unhappily, this technique is not widely used because of maintenance problems. OBIE is a subfield of information extraction which is responsible for the transformation of unstructured text into ontologies, which are the structured representation. The method associates both extraction rules and machine learning methods to perform information retrieval. The pattern rules aren’t flexible, since they cannot cover a large number of variations and this leads to an incomplete retrieval. On the other hand, the machine learning approach could offer too much flexibility, which leads to unrelated terms retrieval. So the system was designed such that having a configuration, the best method is chosen. In example, for a number of x+ y concepts, the system uses x pattern rules and y machine learning based extractors. Randomly made, the configuration may fail, because the extraction choice approach could select the worst extractor for every term. To resolve this, there were proposed two strategies: selection and integration. The first one is responsible for establishing which extractor gives the highest accuracy, so the method which is less prone to give erroneous results is chosen. The second one is a solution for the case in which choosing is difficult because of the performances resemble. The authors have used stacking in order to implement this strategy. The idea behind this concept is to use both methods (pattern rules and machine learning) as input for a classifier. The output will then be used for the implementation. More than that, it is introduced the concept of error detection, having a vocabulary. This means that there exists an extractor for selecting incorrect statements in the domain. In the experiments section the authors describe and compare the three functionalities: extraction rules, machine learning and hybrid.

Another paper, the one of Darshika N. Koggalahewa and Asoka S. Karunananda, "Ontology Guided Semantic Self Learning Framework", published in June 2015 [8] shows a self-learning system capable of gathering information from texts written in natural language. Being a challenge to develop a system like this and because of this, most of the systems are still concerned with controlled natural language. Information retrieval from unstructured texts can be achieved using an ontology-based method. This process is split into two main tasks: information extraction and representation of the extracted data in the ontology. The input of the system is of two types: main and supportive input. The framework receives the following input forms: paragraphs, documents, indexes, and references. All the data is extracted automatically using tokenization, information extraction and semantic tagging and it will be used to create and extend an ontology. The system is decomposed into two main modules: self- learning module and question based learning system. The first module has several responsibilities one being, linguistic processing, in which the text is understood and knowledge is retrieved. Step two is concerned with updating the existing knowledge of the application or creating a new ontology if the knowledge is new to the system. The component is implemented using Gate API since it offers tokeniser, name resolver, sentence splitter, part of speech tagger and semantic mapper. Jape Rules are used to semantically annotate the corpus. The second module is like a query system in which the user asks questions and the system provides answers associated with examples. The first step is to process the question and extract the context. Secondly, the ontology is queried and an answer is given. This step offers more than an association of the knowledge with the answer, since the system uses prior participation in order to give the correct answer. The system has proved to be a successful implementation reaching a performance of 60% percent of accuracy in testing the knowledge base.

In "A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging," [9] it is presented the idea of Named-Entity Recognizers or NER shortly, which are used for information extraction. To extract relevant concepts from domain-specific applications we should recognize named entities. The article presents an approach of this recognition in smart health application. The domain focuses on behavior change, lifestyle change due to over eating, lack of exercise, alcohol and drug consumption. There is designed an ontology based on behavioral health, on which the NER is developed. The NER labels words and phrases from sentences with domain-specific labels, for instance unhealthy food, potentially-risky/healthy activity, drug, tobacco and alcoholic beverage. In order to identify behavioral concepts an ontology and a named-entity recognition system are created. Usually a NER is used for finding and tagging proper nouns into some classes. For example the classes are location, organization, person to which we could add date, percentage and money that are temporal and numerical entities. In this case the named-entities could be classified into: unhealthy/ healthy food, potentially risky/healthy place and potentially risky/healthy activity. Moreover the entities could be framed into alcoholic beverages, tobacco products and drugs (narcotics). The created system uses the following labels: (1) healthy and unhealthy food labels for behaviors related with diet; (2) healthy and potentially-risky activity labels for exercise and alcohol consumption, an example of activity could be partying; (3) healthy and potentially-risky place labels for exercise and alcohol consumption, an instance of place could be night club; (4) alcoholic beverage label for alcoholic beverages; (5) drug label for recognition of drugs; (6) tobacco label for tobacco products. In the case that the system cannot find a label that is polarized for example healthy food, healthy activity and potentially risky place, then it will use neutral labels such as food, activity and place. Such systems that are based on ontologies, entity recognition annotation and information extraction are successfully implemented in biological or business intelligence domains. Such an application could be advantageous for the food domain. For example to create a relation type that consists of pairs of food items that could be eaten together. The use of an ontology provides some advantages in such applications compared to gazetteers (list of names of entities) such as making further reasoning and knowledge acquisition for the concepts chosen. There is a disadvantage when trying to label proper nouns because the feature space of this kind of nouns is not very restricted as it is for common nouns. Common words such as gym, apple, whisky, do not have word-level features for example orthographic patterns or information. The approach used in this paper makes use of WordNet, in order to avoid building and maintaining large gazetteers. As a consequence the system could easily be modified because it is ontology-dependent but domain-independent. So the ontology is expanded with WordNet.

WordNet is a lexical database, a dictionary of the English language. For any word in this language there is a definition or a synonym for a specific concept. WordNet could be also seen as an ontology where are represented relationships: (hypernym/ hyponym) between nous and synonyms (synsets). The writers introduce the concept of WordNetDistance. It is important to make a distinction between semantic distance, similarity, and semantic relatedness. Similarity could be seen as semantic relatedness. Some researches have tried to emphasis the difference between similarity and relatedness. Budanitsky and Hirst say that: "Similar entities are semantically related by virtue of their similarity (bank-trust-company), but dissimilar entities may also be semantically related by lexical relationships such as metonymy (car-wheel) and antonymy (hot-cold), or just by any kind of functional relationship or frequent association (pencil-paper, penguin-Antarctica, rain-flood)." The semantic distance is the distance in hypernym/hyponym tree. This approach uses a WordNet library called Rita to compute the semantic distance. So the distance between two words in the hyponym/hyponym tree is more appropriate to be used with NER than using relatedness. The idea is to compute the distance between any two senses of the two words. The results are normalized to take values in the interval [0,1] and the POS (Part-Of-Speech) tag is specified. In this case the noun is used as POS tag. The steps performed by the algorithm are: (1) find the common parents of the two words; (2) compute the shortest path (minimum distance) to the common ancestor of the two words.[10].

Arunasish Sen, Anannya Das, Kuntal Ghosh and Soham Ghosh, in "Screener: a System for Extracting Education Related InformationFrom Resumes Using Text Based Information Extraction System" [11] compute the distance from the common parent to the root; (4) normalizing the result. First of all, the Stanford Part-Of-Speech Tagger is used in order to identify the nouns from the sentences. Then NER locates the nouns and with the tool from Stanford CoreNLP they are lemmatized. The next step is to split the nouns into some categories: (1) Healthy Food; (2) Unhealthy Food; (3)Healthy Activity; (4)Potentially-risky activity; (5) Healthy place; (6) Potentially-risky place; (7)Drug; (8)Alcoholic Beverage; (9) Tobacco. If the application cannot find one suitable polarized label for a named-entity then a neutral label will be used. Steps of the algorithm: (1) First of all the ontology is queried in order to find a class for matching. If the noun is a class in the ontology then it is tagged with a label from the superclass (the higher level class); (2) If the lemma is not a class, then individuals are queried. If the noun is an individual then it is tagged with the appropriate label of the individual superclass; (3) Moreover, if the noun is not part of the ontology then the distance algorithm is used. So, a component called ranker is used for finding out the minimum distance between the noun and the other classes. Then the class with the shortest path from the noun will be selected. If the chosen class is position at the first level (for instance Alcoholic Beverage, Drug/Narcotic) and the distance is less than higher-threshold, then the noun is labeled with that tag. if the named class is situated at a lower level (for example Beer, Cannabis) and the distance is less than a threshold value the noun will be labeled with that tag. The following example should be considered. If the noun is margarita then the class that is situated the closest is martini. So the distance between them will be short, because martini is not a high-level class. If martini is not part of the ontology then alcohol is the returned class when applying the shortest path. So the distance between margarita and alcohol, which is 0.3, is longer than the distance between margarita and martini, is 0.1, because alcohol is positioned at a higher level in the ontology. So the threshold values are used for adjusting coverage of extension. In addition, a project called Arhinet [12], which is an intelligent system able to perform knowledge acquisition from archival documents. The document corpus is analyzed and information such as Event, Institution, Territorial Division and Title is extracted in order to create a core ontology. The purpose of this application is to extract and lexically annotate information for knowledge retrieval. The first step is to create a corpus from a large variety of historical documents of Transylvania. Then based on the analysis of this corpus the domain ontology is created. There are several concepts these are relevant and these are persons, places, dates and events. The ontology is also used for annotation and semantic querying. The system is split into three modules. The first module is called Raw Data Acquisition and Representation Layer. This module has the responsibility of gathering information and transform it into an easier to use format, a primary database. The next module is concerned with knowledge acquisition. The technical data is retrieved and the input is lexically annotated. These involves using GATE, which offers Jape rules for pattern-matching and relationship definitions. So, the primary domain ontology will be extended with information retrieved from annotated documents. What is more, the annotated text documents will be stored in an xml format. The last module is used for processing and querying. The ontology is queried using SWRL rules. For instance, if two instances of the Person class have as father the same Person, then using the SWRL rule it could be inferred that the two individuals are connected by the are Brothers relationship.

Another work, "SKILL: A System for Skill Identification and Normalization". [13] in the literature describes the concept of named entity recognition (NER) and name entity normalization (NEN). The authors support the fact that there is a gap between the job seekers and the HRs concerning the matching between the right job and the right candidate. There are some aspects that should be taken into account when detecting and extracting information from CV abstracts. The same word could be expressed with different tokens (e. g C# or C sharp). The other feature that could generate ambiguity is the same string of tokens defining multiple notions (e.g Java in Java coffee and Java programming language). The NER concerns with identifying the expression in which a skill form would appear and NEN role is to match that form with an proper individual. The authors define an application called SKILL which detects, extracts and matches a skill with a qualified entity. The system is composed of two main parts called taxonomy of skills formation and the other component is tagging. The taxonomy generation consists of several steps. First of all the authors gather and divide the information concerned with skills. The sentences are split by the stop words. Moreover the further step is to eliminate noise. For this purpose there is a dictionary containing adverbs, adjectives, names and other phrases. After that the next task is to use Wikipedia API for tagging the skill with its corresponding category. The computerization of this task is realized using MediaWiki and rules established on category tags.

"Screener: a System for Extracting Education Related Information From Resumes Using Text Based Information Extraction System “ [14] presents a research project called Screener. This is a tool used to extract information from various profiles of job seekers. There are several characteristics to be extracted from a CV summary, for instance skills, experience and education. After analyzing a large corpus of data, the authors identified 4 sections of the candidate profiles: personal information, education, skills, experience, projects. Another characteristic element is that each section of data from a profile has a label. Having a quite moderate variety of these labels, the authors a specific section from the 4 could be identified. In case there are new combinations of new words used in order to define a segment, the label collection will be enlarged. The resumes relevant information is extracted by means of rule patterns. All the chosen information is recorded into a database and it is linked to a specific user. For the detection and extraction of plain text from different formats like PDF, Word a toolkit named Apache Tika has been used. The output set resulted is then split with the purpose of identifying sections headings and margins. Each identified segment of the summary is inserted the database near its corresponding job seeker. In the next steps the system uses Apache Lucene Framework for searching any inserted word. So, the HR will submit a search for a certain criteria and the system will return the result, applicants possessing certain skills. The task is completed by the Query Parser provided by the Apache framework. To improve the system, in the future the idea is to develop an annotator to analyze the applicant summary and to give as output an xml file having tags corresponding to each segment <skill>, <education>, <experience>. Another improvement would be to combine the system with a component which has the ability to recommend new possible competences.

The authors in paper “Designing a Multi-Dimensional Space for Hybrid Information Extraction” [15] claim that information Extraction is used for obtaining structured text from unstructured format. There are several approaches for obtaining it, knowledge-based or systems that are trained automatically. The advantage of the knowledge-based is that provides high accuracy but by doing a large amount of work, whereas the trained applications are highly portable, but it needs a varied data set for learning. The authors describe an approach for developing a hybrid project, which combines these two ideas. The pattern rules should have two characteristics generic for extracting a large variety of information and specific for identifying only the relevant data [16]. As a result there are some important aspects to consider with machine learning: a great training corpus and features for learning. The paper contains three ideas: multi-dimensional space for method choice, a hybrid system and results evaluation on CV data set. In the first part, the design of the mixed components and the creation of the multi-dimensional space are performed. The design phase takes into account the following Knowledge-based extraction, extension of rules by new rules generation or existing rules update, knowledge base extension (which could be ontologies, thesauri or gazetteer lists). The second part contains the implementation of the system and the third contains the results evaluation. The multi-dimensional space consists of the following NER (Named Entity Recognition), Template Element and Relation construction and Scenario Template production. The information extraction task performed by the hybrid system is sequential using rules and knowledge-based information. The project is evaluated and tested on personal data section from 180 CVs written in English and German, showing detail, length and structure differences between them. The paper detects the information like structure, named entities (name, job title) and templates (address). The preprocessing and the annotation are performed by some machine learning tools (Gate, Mallet and RapidMiner). There are three layers of features to take into account for preprocessing. The lexical and syntactic ones perform a classification like lexical, shallow syntactical and deep syntactical. The knowledge based extended features split into structural (paragraphs, phrases, font information), semantic discourse (duration, date, location) and semantic features (birthday, nationality, skills, phone number, email). The third layer of features extend the first two layers, making them more specific (partOfEmail, personalInfoSection, firstWordInParagraph,language). The efficiency of the system is measured using information extraction metrics comparing different techniques: Perceptron with Uneven Margins (PAUM), Support Vector Machine (SVM), k- Nearest-Neighbor (kNN), Conditional Random Fields (CRF). The CRF offers the best solution, the kNN and SVM results are acceptable, whereas the PAUM are not so good. The main problem of the results is concerned with false negatives which are caused by the varied corpus. The system performs well, but there are several ways for improving it, for example extension of the multi-dimensional space, the difference between the samples of data set, and the number of documents used for training.

In “An Approach to Extract Special Skills to Improve the Performance of Resume Selection”,[17] it is said that a resume is layered. The first layer presents the headings of the blocks (education, skills, experience) of the CV, whereas the second one contains text corresponding to each block. Each resume is different form the others, so there may exist information that makes a candidate better than another in any section. So, the paper approaches the problem of selecting the most appropriate product from a set, taking into account that each product has some specialness. The paper presents the idea of special features extension in order to obtain great results in resume selection. The authors have provided an approach for skills identification part which is split into 2 sections information identification and organization. The extraction is based on some special function called degree of specialness. This function takes as input a feature and classifies an object as separate/distinct/unique or special with respect to some other objects from a set. There are three types of features special, common and common cluster. The features are organized into layers corresponding to each type of feature. The proposed solution works for cases in which the skills are defined as a pair (key, value), where the key is the type of skill. The steps for reaching the goal are the following: preprocessing, identifying the key, value pair and computing the DS (degree of specialty) value of the features. In the preprocessing phase the text is transformed to plain, stop words are eliminated, the key, value pair are expressed using :, multi words skills are concatenated (for instance my sql becomes mysql). The problem of skill being defined in different ways is solved using a hashtable. The next phase identifies the features which could be Skill Type Feature Set (STFS) or Skill Value Feature Set (SVFTs). Then the ds value is computed, and based on the output of this function the skills are organized. The system was tested on 100 student resumes and the performance was measured using performance factor metric.

The authors in paper „Resume Information Extraction with cascaded Hybrid Model“ [18]

elaborated an approach for cascaded information retrieval. The idea is to split the information into sections according to their heading. Each section contains specific information, which will be parsed, for instance name and address belong to personal information block. So the first step is to obtain general info like Personal Information, Education, Research Experience and the second one is to obtain Detailed Info like Educational Detailed Info which contains the following topics Graduation School, Degree, Major, Department. Each block has the most appropriate method of extraction is used. For general information identification and sections like education the Hidden Markov Model (HMM) is used. This model performs well when the text follows a certain sequence order. For the personal information section Support Vector Machine is used (SVM). The HMM is used for assigning labels to blocks. The model uses named entities and words for outputting the features. The authors have used a smoothing method called Good Turing for probability estimation. The SVM Model is used for classification of the personal information section. Usually this is a binary model, but in this system it is used with the purpose of classifying the information into N classes. To obtain this, the same number of classifiers must be used and a One vs. All strategy is implemented. This means that every word will be classified with all the build classifiers, and the appropriate type will be selected by score comparison. The system was tested on a data set containing 1,200 resumes. By evaluating the metrics precision, recall and f-measure on the cascaded and a flat model proves out that cascaded model gives better performance. Another important thing is that HMM performs better than SVM by measuring recall. A drawback of this approach is that if the system performs an error this could be propagated to the next pass, because the system’s architecture is a pipeline.

In the work named:” Entity Recognition” [19], it says that named entity recognition is an important field in natural language processing. This could be used for search engines and for automatic generation of resumes. There are several applications that perform Named Entity Recognition. This paper presents a new approach for NER like person name or organization identification. What is more this idea could be easily extended to any other named entities. If there is the sentence “John and Alice live in Bucharest. They work at Microsoft”, the purpose of the system is to identify that that ‘John’ and ‘Alice’ are person names and ‘Microsoft’ is an organization. The approach uses Standford NLP because compared to other natural language processing tools this gives the best result using the precision, recall and f measure metrics, except organizations identification. For instance for the sentence “John and Alice live in Bucharest. They work at Microsoft.” Microsoft is identified but in case the sentence is “They work at Oracle”, Oracle is not identified as organization. So, the authors have proposed a NER model based on naive Bayesian classifier. Generally there are two ways for NER identification and classification. The first one is when the system identifies new entities based on existing ones (for instance when IBM, Microsoft and Google are in the same context then they could be considered company names). The second method recognizes entities based on the context (for example from “I work at Google”, ‘Goggle could be identified as a Named Entity’). The application uses three sources for training, manually annotated text, Stanford NER annotated text, rules (defined in an .xml format). An example of rule could be the following.

Figure 1.3: Example of rule in [19]

The value of the trust is computed using probabilities from the Bayesian classifier. There is some classification; in this case the word is classified as person name, organization name or location. So, when we want to determine what entity type is a word then we compute the probabilities associated to each word. The probability with the greatest value will indicate the word’s classification. These probabilities are computed using the annotated text; Stanford NER annotated text and xml rules. In the training data each word is annotated with its part of speech and the named entity.

Figure 1.4: Example of annotated text in [19]

The training data of this system was composed of the following manually annotated (150 articles), Stanford NLP annotated data (21,500 articles) and xml rules (20 rules, where each rule has a trust probability associated). Example of such rules are “work at”, “work for” as context sentence for organization, “live in”, “travel to” used before location, “declares that” used before person names. The system is able to identify the organization names from sentences like: “John is going to work for Adobe.” or “Adobe’s stock options increased by 10 %.” A limitation of the system is that the training data should be as good as possible. Another thing is that there is no validation on the named entities obtained. For instance, for the sentence “I work for Unknown” the system will classify “Unknown” as an organization.

The main issue that appears in skill detection is due to polysemy. A term could have a single (type I) or multiple skill senses (type II). Type I forms are always matched with competences. Python is an example in which it will be considered a programming language, not a snake. Moreover, some acronyms will constantly be regarded as a branch of skills. Some cases in point are the linkage of the following terms SVM to Support Vector Machine and ZooKeeper to Apache Zookeeper. To have great precision of the skills and ambiguity avoidance, the authors have used a tool called Word2vec. This group of connected models is used for mapping a string of tokens to several entities. This is based on skip-grams or continuous bag-of-words (CBOW) with the purpose of creating a combination of words that resemble semantically or syntactically. The authors have used three ideas for the systems training. The first one contains raw vectors resulted from the output of seed skills identification. The second one is a refined version of the original vectors that are filtered. And the last approach involves training the vectors by the forms. This is a more complex idea, which enlarges applicability but at the same time aggravates training.

Analysis and Theoretical Foundation

This chapter contains an introduction and also a theoretical foundation that we used in our system creation. We are going to analyze and present each part individually and later present them into our system context.

Problem Analysis

In this chapter we are going to present the flow of events that we implemented in our system. Flows of events are accomplished by the product features.

Flow of events

System main target was to support users in creating a structured CV, from an natural language self description. The CV will be saved of later usage on the system database.

Figure 4.1 Flow of events

Product features

The features that are included in our system define the entire functionality:

Users Creation and Management New users can be added to the system, using register and for each user are allocated private resources.

Skills Detection and Management The system can extract only the relevant information, in our case a skill, and can do the management of the skills (updating the ontology and obtaining the description of the skill based on WordNet).

Use cases

Our system has defined only one type of user, that is the main actor in the entire use cases. The user is able to perform the following operations:

Registration;

Username&Password authentication;

Personal details management;

CV Generation;

Save CV;

Download CV;

Use Case: User Sign Up

This use case describes the flows of events that an actor must follow in order to create an account on our system.

Use Case Name: Sign Up

Primary Actor: User

Stakeholders and Interests: Sign Up process should be as easy and straightforward as possible

Preconditions: The user should not be logged in and to not have an account

Post conditions: The user has a functional account that can be used to log in into the system

Main Success Scenario:

The actor access the sign up section;

The system present the requests fields to be completed;

The actor successfully complete all the fields;

The actor presses the “register” button;

The system checks if all the fields are completed;

The system verifies the eligibility of the data inserted;

The system create an available account ready to be used;

Alternative Flows:

1-3a. Registration process is aborted;

The user close the web-page;

The sign-up process is not finished;

5a. The system detects that some fields are empty;

The sign-up process is not finished;

The system sends an error feedback;

The use case starts with the step 3;

6a. The information inserted is not eligible;

The sign-up process is not finished;

The system sends an error feedback;

The use case starts with the step 3;

*a. At any time, the system fails.

1. The actor restarts the system;

2. The system rebuilds the prior state;

3. The user restarts the sign-up process;

Use Case: User Login

This use case describes the flows of events that an actor must follow in order to be logged into the system.

Use Case Name: Log In

Primary Actor: User

Stakeholders and Interests: Log In process should be as easy and straightforward as possible

Preconditions: The user should not be logged in and to not have an account

Post conditions: The user logged into the system and can use all the features

Main Success Scenario:

The actor access the log in section;

2. The system presents the requests fields to be completed;

3. The actor successfully complete all the fields, username and password;

4. The actor presses the “login” button;

5. The system checks if all the fields are completed;

6. The system verifies the eligibility of the data inserted;

7. The user is authenticated and ready to use the system;

Alternative Flows:

1-3a. Login process is aborted;

The user close the web-page;

The log-in process is not finished;

5a. The system detects that some fields are empty;

The log-in process is not finished;

The system sends an error feedback;

The use case starts with the step 3;

6a. The information inserted is not eligible;

The log-in process is not finished;

The system sends an error feedback;

The use case starts with the step 3;

*a. At any time, the system fails.

1. The actor restarts the system;

2. The system rebuilds the prior state;

3. The user restarts the log-in process;

Use Case: CV Creation

This use case describes the flows of events that an actor must follow in order to be obtain the structured cv from an natural language biography.

Use Case Name: CV Creation

Primary Actor: User

Stakeholders and Interests: CV Creation process should be as easy and straightforward as possible

Preconditions: The user should be logged in and to set up all the personal details

Post conditions: The user receives a structured cv that can be saved or downloaded

Main Success Scenario:

1. The actor inputs the autobiography;

2. The actor starts the processing;

3. The system returns a list of skills with parents

4. The system saves the cv into internal database;

5. The user press “view cv”;

6. The system displays the structured cv;

7. The user presses “download cv”;

8. The system start downloading the structured CV in pdf format;

Alternative Flows:

3a. The skills recognized are not in the ontology but they are found using WordNet;

The user selects the possible parent skills from the list of potential parent skills;

The log-in process is not finished;

The system updates the CV with the before chosen selections;

3b. The skills are not recognized by the system in any way.

The user insert manually the skill into cv.

The system updates the CV with the new inserted skill;

*a. At any time, the system fails.

1. The actor restarts the system;

2. The system rebuilds the prior state;

3. The user restarts the cv creation process;

Data Modeling

We are going to discuss about some system prerequisites. This prerequisites help us model the data and to obtain the desired solution.

O ntologies

Initially used in philosophy representing the name of a metaphysics branch, in charge with analyzing types of existing models highlighting the relations between particular and universal, intrinsic versus extrinsic properties, essence over existence. In computer science context one definition that states very clear what an ontology is, belongs to Tom Gruber, the creator of Siri. intelligent personal assistant, he states that an ontology is a definitions sets of concepts and relationships, that represents an agent or a community of agents.[20]

The data model that describes a set of concepts and the relationships between those concepts and is used to bind the objects within that domain is called ontology.

Starting from Platoon, up until our modern society days, ontologies were build with the intention of describing, explaining and structure the things that surround us.

In philosophy but also in computer science, ontologies represents concepts and events bided together by their relationships. This makes this disciplines to have almost two point of views in two directions. In computer science, researchers tend to put a focus on standardization, by defining robust and restrictive vocabularies and relationships. Phylosofists are rather involved in the thinking process of how to create the vocabularies or relationships without actually being involved in the hands-on construction.

From an engineering point of view, ontologies are pure artifacts written in a specific designated language (ontology language) that works with domain models. A set of individual instances together with an ontology constitutes a knowledge base. I reality there is a very clear separation between an ontology and a knowledge base.

Ontologies Components

A formal and precise description of concepts in the domain of interest (classes also called concepts), properties of each concept that describe various features attributes of concepts (also called roles or properties) and restrictions on roles (also called role restrictions) is called an ontology.

Classes are the core element of the most ontology. We are describing concepts in a specific domain by classes. For example, a class of wine is the representative of all wines. Instances of this class represent specific wines. For example, the Bordeaux wine is an instance of class of Bordeaux wine. All classes can have subclasses that defines concepts that are more specific than the superclass, for example we can define the class of wine having three subclasses: red, rose and white. We also have properties of classes and instances, in our example would be: Chateau Lafite Rothschild Pauillac. We can define here two properties that describe the wines we have: the wine with the slot body (Chateau Lafite Rothschild Pauillac) and property marker (Chateau Lafite Rothschild winery). From a class perspective we can say that instances of the class Wine will have properties that describe the sugar level, the flavor, the year and many more. All instances of class Wine, with the subclass Pauillac, have the maker property wich is an instance of the class Winery. The Winery class has a property called produces that describe all the wines that are part-of or is-a relationships. A descriptive picture can we analyzed bellow:

Figure 4.2 [21] Some classes, instances and relations between them in the wine domain. We used black colour for iclasses and red for instances. Direct links represent slots and internal links such as instance-of and subclass-of.

Ontologies Clasification

The figure below presents a ontologies classification. The classification is made using the scope of the objects that are included in the ontologies. As an example the scope of the local ontology is narrower than the scope of a domain ontology, because a domain is more general than an local/application ontology. The general ontologies describe only concepts that are not dedicated to a specific domain, they represent abstract terms.

Figure 4.3 Ontology Clasification [21]

Local Application and Task Ontologies

They are the particularity of a domain. They cannot represent a wide knowledgebase. The task ontology is specific to a single task contains only information about that particular corner case.

Domain Ontologies

They are dedicated to only one domain sharing the knowledges related to only that specific domain. The domain ontology can we connected with application. An example of domain ontology is our ontology skills.owl. The ontology is split into two main categories regarding the type of competences. The first one is domain_specific_skills_and_conpetences and the other one is the negation of the first one non_domain_specific_skills_and_competences. Our ontology has 7,381 classes in the vocabulary and a total number of 7,380 relationships. The maximal taxonomy depth is in our case 8.

Core Reference Ontologies

The aggregation of several domain ontologies. Each group brings his own viewport that is applied to the domain ontology forming in result core reference ontology. A good example for this type could be Hydrontology. When was presented hydrontology was gathering different information. These sources are chose base on the need of the institutions that are using it.

General Ontologies

As we presented before general ontologies are not specific to any domain they are the abstractization of many more ontologies. A good example is OpenCyc Ontology. This ontology is a reasoning engine. Designed in OWL format the ontology contains many entities and relations related to the human consensus reality.

Foundational/Top Level/Upper Level Ontologies

Foundational ontologies represent the first block in crafting ontology. We start with a minimal knowledge base and we step by step progress to a domain ontology The top and upper level is a sort of foundational, but consist only from some elements of the domain. There are many example that covers this category because is the one used in words ontologies creation.

Ontologies Design

We are going to have a close look to the process of how an ontology can be created.

Domain and scope of the ontology

The first step in order to create a good ontology would be to define the domain and scope of the ontology. The basic questions in order to obtain this would be: to understand the domain that this ontology will apply to, what is the purpose of the ontology, what are the questions this ontology answers to and how this ontology will be maintained and by who. In the design process this answers may answer, for example if our ontology will be used in natural language processing than maybe is a good idea to have synonyms and part-of-speech information associated to the concepts in ontology.

Considering reusing already designed and implemented ontology

Like in our case, we start from an already designed ontology of skills and our aim is to populate and update with every new skill inserted. We are using this approach because we save time and resources and we already use an ontology that has defined conepts that can be applied to our domain.

Define the important terms in ontology

The terms that want to be included in our ontology should be considered first because they provide the tree-structure and also confirms that we have defined the right system.

Classes and class hierarchy

Like in software development the class definitions an hierarchy abbey to a similar design pattern. We have bottom-up approach that starts with the most detailed concepts and pops up the more generic ones that include the lower ones. The next design patter is a combination of them, we start with some concepts very specific than we define some more general ones and after that we came back to the specific one and so on so forth.

Figure 4.4 Level Examples [21]

Each approach has its own benefits and negatives; the most important step is that we have to define classes. From the above step we are going to transform into classes the elements that are independent, their existence do not depend on other objects. These elements will become classes and will be the parents in the tree-structure taxonomy. Tree-structure taxonomy implies a is-a or kind-of.

Properties of classes

Standalone classes will not offer us the entire information about our domain model, so we will not be able to answer to the question from the first step. We should pay an attentive look to the relations inside the ontology. We are going to attach to the classes some properties, for example we have the class Object we attach to this class the properties like color, dimension, size etc. We define here some categorization for the properties: intrinsic properties represent the properties that describe the interior of the class, like the flavor and extrinsic properties such as name, color, and dimension.

Create instances

The final step before releasing, we should add more instances of classes that adopts the hierarchy. In order to define an instance we first should select the required class, creating an instance of that class and creating the properties for that instance in correlation with the class.

The W3C, the foundation responsible for defining an maintain the Web standards, has created some guidelines and implementation languages for ontology development.

RDF

Graph based model, in which the resources are represented by nodes and the binary relations are represented by vertices. RDF defines a semantic network that defines taxonomies were the relationships predominant are Parent-Child.

RDFS

Starting from RDF by desire of increasing the definition capacity a new language finds its roots. By adding slots to RDF the concept abstraction can be made at different levels. RDF uses the oriented object model (OOM) that facilitates the ontology creation with low coupling concepts.

DAMDL

The governmental organization DARPA has conducted the creation of DARPA Agent Markup Language, dedicated to web-semantics. DAML prove excellent results to interrogation and has strong input in knowledge base management applied to e-commerce.

OIL

Ontology Inference Layer a project created by European Union that letter has unite with DAML and create the most powerful and used ontology language OWL.

OWL

Web Ontology Language is the standard accepted by W3C. OWL consist of 3 different solutions which varies depending on the complexity and expressivity of the ontologies created:

OWL-Lite – as the name states is a lite version of the OWL core. Does not use all the concepts and is the lightest ontology creator language and also one of the lightest for the object oriented database from the point of view of classes ore relationship.

OWL-DL – offers more than just a simple inheritance relation or logical relation like: conjunction, negation or disjunction.

OWL-Full – the most powerful OWL implementation. Beside the fact that is the most complex one, does not guaranties the maximum result to interrogation, because in in contrast to OWL-DL were the class instances were disjoint here in OWL-Full are not disjoint.

OWL transformed along time into a family of ontology languages that supports many implementation and syntaxes. Along this we can still make a difference between high level syntaxes and exchange syntaxes.

High level syntaxes are specific to OWL structure and semantic. This syntax presents an ontology as a sequence of annotations, fact and axioms. Annotations are in charged of representing human and machine meta-data, meanwhile the axioms carries the information about the classes, properties and individuals that compose the big picture, the ontology.

Exchange syntaxes are combination of RDF syntaxes and OWL XML. Have almost the same structure like an XML document and carries precise information defined by label:name.

Ontologies Learning Strategies

Taking into consideration the actual context were we can see a trend to migrate from data processing to concept processing, in this way we can conclude that ontologies captures semantic knowledge by using concepts description and by using their relations.

Knowledge Discovery science is in charged of developing techniques of discovering new knowledge. The techniques involved are: human intervention for learning complex data, semi-automated techniques. Knowledge discovery can be realized using structured domains, text domain, documents or Web.

It is considered from a qualitative point of view that a dedicated automated technique for ontology creation is not possible. Technologies have the responsibility to make this knowledge discovery process as optimal as possible up to the point were human interaction is minimum. Knowledge Discovery science beliefs that ontologies can be described as model classes, which population can be done by following precise tasks. Mapping the ontology components, in the situation were some components are given and other that are not, the true purpose, being the insertion of the new elements in the ontology. The basic scenarios that this goal can be obtained could be:

Concept insertion by existing instance.

Relationship insertion based on concepts and associated instances.

Ontology population using a existing instance set unrelated with the concepts.

Generating the ontology from instances and other information.

Ontology extension using new instances and auxiliary information.

We are going to present further some techniques which can be useful in building ontologies.

Unsupervised Learning Techniques

Inspired from machine learning, the unsupervised learning technique starts from nothing without any knowledge about the desired output, and step by step is learning from the previous steps. In unsupervised learning all the observations are caused by the latest variables, the observations are considered to the latest in casual chain. Using unsupervised techniques it is possible to learn more larger and more complex data than in supervised learning. In unsupervised the learning process can evolve from the observations into even more abstract levels.

Clustering is an important component in unsupervised learning technique. Clustering is based on a frequency algorithm dedicated for string data. The input text is considered to be a vector of words and for every word there is a value assigned, representing the apparition frequency for that word. This method is also recognized as TFIDF- Term Frequency Inverse Document Frequency. A good example of a system that uses an unsupervised technique for learning is OntoUSP [22] that updates the ontology based on some input texts. The method used here is based on clustering with natural language processing.

Supervised and active Learning Techniques

In contrast with the previous method we have a technique that stat with a data set and a known output. In other words we know the correlation between the input data an the result to them. Supervised learning problems are categorize in “regression” and “classification” problems. In regression we are trying to detect the result within a continuous output, meaning that we are trying to map input variables to some continuous function. In classification problem, we are trying to predict the results in a discrete output. In other words we are trying to map the input variables into discrete categories. As we can deduce by now a complete automated system that can determine the knowledge’s with precision is quite expensive and very hard to implement, under this conditions human interaction is required. In contrast a complete human based system is not very efficient and that is why we tend to prefer a semi-automatic system that with a minimum effort from human part can produce the same output having the same quality. The basic ideas behind these learning techniques consist in the fact that the methods are using label for labeling the data. Due to the labeling process and to the data set size we can determine how requested is the human interaction.

Text Learning Techniques

Ontologies can be learned form various sources, starting from databases up to dedicate taxonomy and other dictionary. The most interesting one is the unstructured text learning. The methods proposed here are searching for nouns and for is-a relationship concept hierarchies, in contrast that mapping the entire ontology with relations. One assumption that played a game changer role was “Harris distributional hypothesis”, (Harris,1969) [23]. He proclaimed that words tend to occur in similar context and his computation proved so. Methods and algorithm for detecting is-a and part-of relations were implemented.

Our solution fits the best in this category, proposing a mechanism for updating the ontology based on an algorithm that search nouns that could be skills. These searches are made in an unstructured text received in a natural language from the user.

WordNet

Developed in the Princeton University laboratory with direct coordination of psychology professor George Miller, WorNet had become one of the most elaborated databases that incorporates semantic lexicon for English language dictionary. WordNet creates sets of words grouped by synonyms and this group are called synsets. Synsets provide definitions, short descriptions and usage examples. Relations connect each synonym with other words. Widely used into artificial intelligence applications and text analysis tools in combination with dictionary and thesaurus proved to be a very powerful tool.

WordNet Database

Taking into consideration the latest version of WordNet release, the database contains 155,287 words grouped in 117,659 synsets for a total of 206,941 word-sense pairs and in compressed form, 12 megabytes.

As previously explained words in the same lexical family are grouped in synsets, that also include collocation and different expresionss. Due to grammaticality function WordNet is able to distinguish between different parts of propositions like nouns, verbs, adjectives and adverbs. How are this synsets connected between them? The answer is simple by means of semantic relations. In the following lines we are going to see wich are this relations applied to different examples:

Nouns:

Hypernyms: Y is a hypernym of X if every X is a (kind/type of) Y

(car is a hypernym of autovechicle)

Hyponyms: Y is a hyponym of X if every Y is a(kind/type of) X

(autovechicle is a hyponym of car)

Coordinate terms: Y is a coordinate term of X if X and Y share hypernym

(truck is a coordinate term of autovechicle, autovehicle is a coordinate term of truck)

Meronym: Y is a meronym of X if Y is a part of X

(engine is a meronyme of car)

Holonym: Y is a holonym of X if X is a part of Y

(car is a holonym of engine)

Verbs:

Hypernym: verb Y is a hypernym of verb X if the activity X is a (kind/type of) Y

Troponym: verb Y is a troponym of the verb X if the activity Y is doing X in some manner

Entailment: verb Y is entailed by X if doing X you must be doing Y

Coordinate terms: the verbs sharing a common hypernym

The above presented semantic relations, are present between all the linked synsets members.

Entities structure

Composed of hierarchies defined by hypernym or is-a relationships, both nouns and verbs are key element in the tree-structure. These hierarchies are structured into 25 beginner “trees” for nouns and 15 for verbs and all are linked to a unique parent called “entity”. Apart from verbs and nouns we have adjectives that are a special kind and are not having the same structure. By using a poled based structured for example: at one pole we have bad and at one bole we have god, and between them we have solitary or ‘satelite’ synonyms.

Dog, domestic dog, Canis family

canine, canid

carnivore

placental, placental mammal, ethereal, eutheral mammal

mammal

vertebrate, craniate

chordate

animal, animate being, beast, brute, creature

WordNet as an ontology

Even that was not designed to be an ontology, due to the hypernym/hyponym relationships between the noun synsets can be interpreted as a special relations among conceptual categorise. All of this made WorNet suitable to be used as a lexical ontology in the computer science domain. Due to the fact that was not designed to be an ontology, when is used as an ontology we encounter some drawbacks. These drawbacks made WordNet suffer a lot of transformation and reinterpretation.

GATE

The biggest open source project that tackle the natural language processing with over 15 years old, Gate is an active use for all types of computational tasks [24]. Gate includes functions for diverse language processing tasks like parses, tagging and together with an Information Extraction system (Annie) has put this software in the top list. Annie like we are going to se later is widely use together with OWL (metadata).

Some of the fields were GATE proved to be very efficient:

Computational Linguistics: science language were as investigating tool is used computation;

Natural Language Processing: computation science that tries to understand the algorithms for human language processing;

Language Engineering: the science of building systems capable of language processing for which we can estimate a time of delivery and also the cost assigned for creating such a system

GATE has become so powerful and successful is that the core is splited into little components along the Java component model. Bellow you can see the diagram that describes very clear this:

Figure 4.5 – GATE Core Structure [24]

Jape Rules

Annotations were created to label elements. JAPE is an Java engine for Annotation Patterns. Version of CSPL – Common Pattern Specification Language, JAPE offers a numerous state transduction over annotations based at low level on regular expressions.

Why JAPE when we have REGEX? Regular expressions (REGEX) are applied on a simple an straightforward sequence of items, but we want to apply them on a much complex data structure.

JAPE grammar consists of a group of phases and each one consists of pattern/action rules defined. This groups run in cascade of infinite state over annotations.

JAPE rules has to major parts:

The left-hand-side (LHS) part is an annotation pattern description. This part is responsible for matching the desired text to be annoted and in the same time is avoiding undesirable matches.

The right-hand-side (RHS) part consists of annotation for manipulation statements. This part holds the necessary information about the annotations that has to be created or manipulated.

Lets se the following example:

Figure 4.6 JAPE Rule example

In the above example the LHS represents the lines before the mark “–>” and the RHS is the next lines. Considering the above explanation about LHS and RHS, in this example we have a rule entitled ’CarName1’, which will match text annotated with a ‘Lookup’ annotation with a ‘majorType’ feature of ‘carname’. When this rule match a sequence of text, the entire matched sequence is labeled with the rule, in our case with ‘carname’. On RHS, we refer to the spoted text using the text given in the LHS in our case ‘carname’. We can say now that for this text is given an annotation of type ‘CarName’ and a ‘rule’ feature set to ‘CarName1’.

The non-functional requirements that consist of efficiency and speed directed us to use JAPE rules due to the time factor and to the performance.

Method based on JAPE Rules and WordNet for Detecting Professional Skills from Resumes Written in Natural Language

We are proposing a solution for skill detection based on JAPE rules and WordNet for detecting skills. We divide our method, in small components that all combined result in the desired solution.

The following image presents the main steps of the method. First the user introduces a natural language self-description. This input data is pre-processed using tools from GATE. After the text is pre-processed we try to find skills by applying the JAPE based skill detection module, for all skills we are going to communicate with the ontology. For the new skills that are not in the ontology we are trying to find the parent of it in order to be able to insert it into the ontology. The last step is to save the CV and to update the ontology.

Figure 5.1 Main Steps of the Method

Crawler

Our solution for the implementation requires resumes or letter of applications, for testing and research purposes. We have designed a crawler, a tool that automatically visits dedicated Web sites and extract only the desired information, in our case the resumes. The crawler was designed for understanding the rules for which a skill can be identified and for testing phase.

Input data pre-processing and JAPE rule definitions

In order to obtain the desired results our solution uses corpus processing and natural language processing (NLP). We want to obtain the maximum results from our input data analysis so we have to pre-process the input corpus.

The steps in the pre-processing a corpus are:

Tokenization

Sentence splitting

POS-tagging

Gazetteer

We are going to have a look at each one individually and explain the basic concepts.

Tokenization splits the input text into very simple “tokens” like punctuation marks, words or numbers.

Figure 5.2 Tokenization Effect

Sentence splitting finite-state transformation, which segments the input text into sentences.

Figure 5.3 Sentence Splitting Effect

Part-of-speech (POS) tagging assigns for each word a tag that represents the part of speech.

Figure 5.4 POS Tagging Effect

Gazetteer at this step the entities names are identified. For example:

ECU European Currency Units;

NT Dollar New Taiwan dollar;

We had analyzed around 6000 resumes and letters of applications. During our research we had identified some common patterns. First of all we identified that skill are most likely to appear in nouns or noun constructions.

JAPE rules

Using the context generator module we find out that the skills are almost all the cases nouns or nouns constructions. We also defined some rules based on some specific constructions.

The first rule we had defined is “skill.jape”. This rule help us identify the nouns that are in the input data that are also in the ontology. By analyzing the ontology we discovered that each skill is determined by the parent, description and type. Here is a short example to understand better:

<rdf:Description rdf:about="http://www.semanticweb.org/ontologies/skills-ontology#Python">

<rdfs:subClassOf rdf:resource="http://www.semanticweb.org/ontologies/skills-ontology#programming_languages"/>

<rdf:type rdf:resource="http://www.w3.org/2002/07/owl#Class"/>

</rdf:Description>

From the example above we can see that the skill has a name or URI, his parent (subClassOf) and type(class). We create the JAPE rule based on the following two constraints: the skill should have an URI and should be a noun. We came with the following rule:

Phase:nestedpatternphase
Input: Lookup Token
Options: control = appelt
Rule: Skills
Priority: 20
(
{Lookup.URI =~ "http://www.semanticweb.org/ontologies/skills-ontology"}//, Token.category =~ "NN", Token.category !~ "S"}
): label
–>
:label.Skills = {rule = "Skills", skill = :label.Token.string, category = :label.Token.category, URI = :label.Lookup.URI}

This method can successfully discover all the nouns that are in our input text and are also in the ontology.

The next rule we had defined is the “have_experience.jape” this rule can detects the skill that are included in constructions like “I have experience in/with +NN”.

Phase:nestedpatternphase
Input: Token Nouns
Options: control = appelt
Rule: Have_experience
Priority: 20
(

({Token.string == "I"})
({Token.string == "have"})
({Token.string == "experience"})
({Token.string == "in"}|{Token.string == "with"}

)
({Nouns})+
): label
–>
:label.Have_experience = {rule = "Have_experience", skill = :label.Nouns.skill}

The following rule can detect the skills that are included in the following type of constructions: “I know + NN”. The name of the rule is “i_know.jape”:

Phase:nestedpatternphase
Input: Token Nouns
Options: control = appelt
Rule: I_know
Priority: 20
(

({Token.string == "I"})
({Token.string == "know"}

)
({Nouns})+
): label
–>
:label.I_know = {rule = "I_know", skill = :label.Nouns.skill}

Using this rule we can detect the skills that are inside an proposition that has a modal verb in it. For example “I can play football”, the word “football” will be identified as possible skill. For covering this rule we had defined “modal_verb_skill.jape”.

Phase:nestedpatternphase
Input: Token Nouns
Options: control = appelt
Rule: Modal_verb_skill
Priority: 20
(

({Token.string == "I"})
(({Token.string == "can"}|{Token.string == "may"}))
({Token.category == "VB"})
({Nouns})+
): label
–>
:label.Modal_verb_skill = {rule = "Modal_verb_skill", skill = :label.Nouns.skill}

Another patter that we identified is “I write in + NN”. This rule successfully detects the possible skill targeted by the rule “write_in”:

Phase:nestedpatternphase
Input: Token Nouns
Options: control = appelt
Rule: Write_in
Priority: 20
(

({Token.string == "I"})
({Token.string == "write"})
({Token.string == "in"})
({Nouns})+
): label
–>
:label.Write_in = {rule = "Write_in", skill = :label.Nouns.skill}

This rule identify very well the high level capabilities that are expressed using thiw phrase “I rock at+ NN”. The rule is called “rock_at.jape”:

Phase:nestedpatternphase
Input: Token Nouns
Options: control = appelt
Rule: Rock_at
Priority: 20
(

({Token.string == "I"})
({Token.string == "rock"})
({Token.string == "at"})
({Nouns})+
): label
–>
:label.Rock_at = {rule = "Rock_at", skill = :label.Nouns.skill}

Working with a technology or describing what you have worked represents another rule, this is way we identified the following construction: “I worked/interact with + NN” and we denote this rule “worked_with.jape”:

Phase:nestedpatternphase
Input: Token Nouns
Options: control = appelt
Rule: Worked_with
Priority: 20
(
({Token.string == "I"})
({Token.string == "worked"}|{Token.string == "interact"})
({Token.string == "with"})
({Nouns})+
): label
–>
:label.Worked_with = {rule = "Worked_with", skill = :label.Nouns.skill}

Another very basic rule that represents very well the skills is “good_at.jape”. This rule determines the skill in the following constructions like: “I am good/master/grate/perfect/excellent at +NN”:

Phase:nestedpatternphase
Input: Token Nouns
Options: control = appelt
Rule: Good_at
Priority: 20
(
({Token.string == "I"})
({Token.string == "am"}|{Token.string == "'m"})
({Token.string == "good"}|{Token.string == "perfect"}|{Token.string == "master"}|{Token.string == "excellent"}|{Token.string == "grate"})
({Token.string == "at"})
({Nouns})+
): label
–>
:label.Good_at = {rule = "Good_at", skill = :label.Nouns.skill}

These rules help us to determine the skills and the possible skills and also the parents were to insert it.

Skill Detection

As we presented before, one of the requirements was to detect very fast and as efficient as possible the skills present in the input data. We are using the ontology for storing an retrieving the skills.

Our skill detection algorithm takes as input the following parameters:

User resume – the users own description presented in natural language. In this description he presents his strength and weakness.

In return the algorithm generate a structured resume based on the ontology and on the new skills detected using rules recognition and WordNet.

The algorithm consists of an initialization step, where the corpus pre-processing is made. In the next step the skills that are already in the ontology are detected and a stage were new skills based on the defined rules are recognized.

Skill Detection Algorithm

–––––––––––––––––––––––––

Algorithm 1: Skill Detection

–––––––––––––––––––––––––

Inputs: resume – one of the resumes obtained by the crawler,

ontology – the ontology we are using

Output: resume – the resume in a structured form

Begin

doc = preprocesingResume(resume);

docSkillsSet = Get_All_Skills_Existing_in_Resume_and_Otology(doc);

nonSkillsSet = Get_All_Skills_in_Negative_form(doc);

resumeSkillsSet = docSkillSet – nonSkillsSet;

skillsParents = Get_Ontology_Map(ontology);

foreach skill in resumeSkillsSet do

if(skillsParents contains skill)

then Add_Skill_To_Resume_Skills_Map(resume,skillParent,skill)

else Add_Skill_To_Set(potentialSkillSet, skill)

end foreach

patternSkillsSet = Add_To_Set(Get_All_Skills_by_Rulles(doc))

patternSkillsSet = patternSkillsSet – resumeSkillsSet

foreach skill in patternSkillsSet do

descriptionSet = Get_Description_WordNet(skill)

foreach element in descriptionSet do

elementParentsSet = Find_Parents(element)

potentialParentsSet = potentialParentsSet + elementParentsSet

end foreach

Add_Skill_To_Resume_PotentialSkills_Map(resume,potentialParentsSet,skill)

end foreach

return resume

END

As we presented before, in the first step the algorithm do some pre-processing work, Tokenization, Sentence Splitting, POS-Tagging and Gazetteer.

In the second step the algorithm detects the skills from the input data that are already in the ontology:

Get_All_Skills_Existing_in_Resume_and_Otology – using the jape rule defined we are searching for nouns or nouns construction from our input data that are also in the ontology;

Get_All_Skills_in_Negative_form – the first method detects all the nouns even that they are in a negative construction and express the lack of knowledge. Using the jape rule for the negative expression we are searching them in our input data;

resumeSkillsSet – is the difference set between two entity sets: the first one returned by the first methods representing all the skills in the input data and the second one representing the negative skills in the input data; in final we have the set of skills that are in the input resume and are not in a negative structure;

Add_Skill_To_Resume_Skills_Map(destinationMap, skillParent, skills) – as output our resume will be formed of two maps structure: the first one representing the <parent, skills>, were skills are in ontology; the second one representing the <parent, skills>, were skills are not in ontology and their parent was detected using rules and WordNet;

Add_Skill_To_Set(destinationSet, skill) – is adding the skill to destinationSet;

Get_All_Skills_by_Rulles – we have defined more jape rules to detect the possible skills in different constructions. This method apply those rules and returns the set of the skills detected;

patternSkillsSet – is the subtraction set, obtained from the deletion of the elements that are in the patternSkillsSet and in the resumeSkillsSet. We are doing this step because some skills that are detected inside the patterns might also been detected by the first jape rule;

Get_Description_WordNet(skill) – returns the description from WordNet;

Find_Parents(element) – this method search the ontology in order to find the parent of the ontology;

Add_Skill_To_Resume_PotentialSkills_Map(destination, skillParentsSet, skill) – this is the second map associated to the output resume;

Detailed Design and Implementation

System Architecture

The scope of the system is to provide support for end user that want to create their resume is a structured form, by describing their capabilities; or to provide support, to human resources departments or companies, offering a tool that transform the letter of application into a structured resume that describes the candidate capabilities. This tool can help them filter the candidates based on a set of skills.

Figure 6.1 System Architecture

The picture above present the system architecture. The main components are Skill Detection Module that receives the input text returned by the Text Pre-processing Module and uses Jape Rules Definition to identify the existing skills, that we confirm with Skills Ontology and to identify the new skills that we confirm with WordNet. The Ontology Update Module is responsible with the insertion of the new discovered skills into ontology. Graphical User Interface is the component that makes the system usable by end users.

We considered for the system architecture implementation, a multilayered architecture were each layer has a dedicated and different responsibility and all together create the entire application. The factors that determined us to use this type of architecture are:

Responsibilities are clear and separated between every component.

Workflow can we easily tested and exposed.

The ability to extend or replace several layers without affecting the entire system functionality.

Figure 6.2 Implementation Architecture

The implementation architecture is presented in the above picture. The Communication Layer, Business Layer, Data Access Layer and Data Layer are implemented as a REST –API Web Service.

The REST API, is an architectural style, that follows the client-service design pattern. Use HTTP requests for implementing the Communication Layer, between the presentation layer and business layer. Here is an example of an HTTP request, in our case:

User Registration: POST “/ontology/register”;

Forgot Password: POST “/ontology/api/forgot”;

User Login: POST ”/ontology/api/login”;

Statelessness is one of the key elements of the REST-API, no state are recorded on the server. This provides individuality to all the operations. Every HTTP request are executed individually on the server without contact to other request, in this way we are avoiding HTTP sessions and cookies and in the same time we provide a simple solution that can be used both for web applications or mobile applications.

Figure 6.3 Communication Model

Persistent storage and data access layer

The data layer represents the mechanism of storing data into the system. The data stored, is retrieved by access layer that sends the data for processing to business layer, which later send the processing output to the user interface.

Access layer is the door to the saved data. It exposes an interface and a way of accessing it without creating dependencies inside the logic layer. Dependencies in the logic layer increase the complexity level and changes that do not affect the entire system become very difficult.

Our system solution uses also the database but also the ontologies for obtaining the skills. Implementations for working with databases and ontologies were implemented. The figure below presents the tables used for storing the data in database.

Figure 6.4. Database structure

The figure represent the total number of table and the relation between them. We are going to take each of one individually and discus about it.

The “user” table stores personal information like id which is the primary key, name, email , password, contact details, salt- the additional part of password used for salt cryptography and also some fields defined for spring security.

The “authorities” table used for spring security stores information like user id and different user roles.

The “resume” table that saves a part of the entire resume has as primary key the id and as foreign key the user id and also saves additional information about the user.

The “education” is the table that saves the educational information about a user resume, as “resume” has a primary key the id and as foreign key the resume id.

The “skill” table is the one that combined with “resume” and “education” stores on the system the users resume. Has as primary key an id, and as foreign key the resume id. Beside this it has information about skill and their parents accordingly to the processed input.

The “rating” table is used when is presented to the user the possible skills. Based on his previously chosen parents, this table is updated and incremented.

The entire database is build using PostgresSQL and is made entirely from the code using annotation.

@Entity
@Table(name = "fuser", schema = "public")
public class User {
@Id
@SequenceGenerator(name = "UserSeq", sequenceName = "fuser_id_seq", allocationSize = 1)
@GeneratedValue(strategy = GenerationType.SEQUENCE, generator = "UserSeq")
private Long id;
@NotNull
private String username;
private String fullname;
private String email;
@NotNull
private String password;
private String salt;
private String address;
private String phone;
private Calendar dateOfbirth;
@Transient
private long expires;
@NotNull
private boolean accountExpired;
@NotNull
private boolean accountLocked;
@NotNull
private boolean credentialsExpired;
@NotNull
private boolean accountEnabled;
@OneToMany(cascade = CascadeType.ALL, fetch = FetchType.EAGER, mappedBy = "user", orphanRemoval = true)
private Set<UserAuthority> authorities;
@OneToMany(cascade = CascadeType.ALL, fetch = FetchType.EAGER, mappedBy = "user", orphanRemoval = true)
private Set<Resume> resumes;
@OneToMany(cascade = CascadeType.ALL, fetch = FetchType.LAZY, mappedBy = "user", orphanRemoval = true)
private Set<Rating> ratings;
public String getUsername() {
return username;
}

Data access layer is composed from two main components: the first one is managing the database access and the other one the ontology management.

The managing database component is designed in our system using two layers: the first one is the repository layer that directly works with the database and the second one is the service layer that has all the domain services.

The repository layer is based on five interfaces (EducationRepository, RatingRepository, ResumeRepository, SkillRepository, UserRepository) that using the before presented classes the maps the domain model (Education, Rating, Resume, Skill, User) extend and implement the CRUD operations (create, read, update, delete).

Figure 6.5 Respository Structure

We also implement different methods for quarrying the database based on our needs. Using the annotations from Spring we can create this methods inside the interface.

/**
* @author Neste Vlad
*/
@Repository
public interface RatingRepository extends CrudRepository<Rating, Long> {

@Query("select r from ontology.domain.Rating as r where r.user=:user")
List<Rating> findUserRating(@Param("user") User user);
}

Figure 6.6 Database Query Example

The second component of the database managing is the Service layer that, provides to the upper classes the functionalities of working with the database. The Service Layer implements the abastractization introduced by the Repository Layer with Domain Layer.

Service Layer is the interface between the precise database management and the system management.

/**
* @author Neste Vlad
*/
@Service
@Transactional
public class RatingService extends OntoService<Rating> {
@Autowired
private RatingRepository ratingRepository;
@Override
public Rating save(Rating rating) {
return ratingRepository.save(rating);
}
@Override
public void delete(Rating rating) {
ratingRepository.delete(rating);
}
@Override
public Rating find(long id) {
return ratingRepository.findOne(id);
}
public List<Rating> findUserRating(User user){
return ratingRepository.findUserRating(user);
}
public List<Rating> findAll(){
return (List<Rating>) ratingRepository.findAll();
}
}

Figure 6.7 RatingService Model of Serice Layer implementation

The ontology management component is in charged for all the operations on the ontology. We are working with skills, so we have to interrogate, find, define, add and many other operations for data manipulation.

The class in charged for all of this operations and data access is the “OntologyManager“ class. “OntologyManager” class is implemented using Singleton design pattern. It means that we are using only one instance of it. The connection with the ontology is made only one at the beginning. The class implements the methods responsible with the skill addition, parent interrogation or context retrieval.

private Map<String, String> getSuperClasses(Model model) {
Map<String, String> map = new TreeMap<>();
StmtIterator iter = model.listStatements();
// print out the predicate, subject and object of each statement
while (iter.hasNext()) {
Statement stmt = iter.nextStatement(); // get next statement
com.hp.hpl.jena.rdf.model.Resource subject = stmt.getSubject(); // get the subject
Property predicate = stmt.getPredicate(); // get the predicate
RDFNode object = stmt.getObject(); // get the object
object.getClass().getSuperclass();
if (predicate.toString().contains("subClassOf")) {
String concept = getConceptName(subject.toString());
String parent = getConceptName(object.toString());
map.put(concept, parent);
}
}
return map;
}

Figure 6.8 getSuperClasses Method implementation

The “OntologyManager” will be used by the “OntologyService” that implements the business layer that will be presented in the following chapters.

Figure 6.9 OntologyManager Class Diagram

Business Layer

What our system should do is designed and implemented at this layer. As our System Components Diagram the main task that are solved at this level are: skill detection, education detection, WordNet interrogation, ontology update, user management and security.

Skill Detection Component

As we already presented, our approach for solving this problem was based on JAPE rules. We are using GATE tools for already presented natural language pre-processing steps. We already defined the JAPE rules we are using; now we are going to concentrate on the implementation details now.

The class responsible for implementing the web service requests that works with the ontology is OntologyResource. This class implements only the http responses, based on the OntologyService. Before presenting an method implementation from these classes we should have a look at one feature that was already presented theoretically: ratingSet.

We already presented at a theoretically level that for each user we work with Rating Table in order to provide the so-called context of the resume. The entries to this table represent the parents, user previously chosen. Based on this we can suggest more possible parents for the possible skills.

Returning back to the http methods implementation, we are going to present the implementation of the skills detection algorithm.

@POST
@Path(value = "/processText")
@Consumes(MediaType.APPLICATION_JSON)
public ResumeBinding processBio(OntologyBinding binding) {
resumeBinding = null;
try {
resumeBinding = ontologyService.processText(binding.getBiography());
for(String s: resumeBinding.getPossibleSkills().keySet()){
Set set = sortSet(s, binding.getUsername(), resumeBinding.getPossibleSkills().get(s));
resumeBinding.getPossibleSkills().put(s, set);
}
} catch (ResourceInstantiationException | ExecutionException e) {
LOG.error(e.toString());
}
return resumeBinding;
}

Analyzing the method we can see that OntologyService is called and the biography is passed to it. We are going to have a close look to the processText method from the OntolosyService.

public ResumeBinding processText(String text) throws ResourceInstantiationException, ExecutionException {
Corpus corpus = Factory.newCorpus("myCorpus");
Document doc = Factory.newDocument(text);
corpus.add(doc);
annieController.setCorpus(corpus);
annieController.execute();

ResumeBinding resumeBinding = ServiceHelper.getResume(doc);
Map<String, String> skillsWithParent = getOntologyManager().findAllSkillParent();
for(String s: text.split("\\W+")){
if(skillsWithParent.containsKey(s)){
if(!resumeBinding.getSkills().containsKey(skillsWithParent.get(s))){
Set<String> skill = new HashSet();
skill.add(s);
resumeBinding.getSkills().put(skillsWithParent.get(s), skill);
}
else{
Set<String> skills = resumeBinding.getSkills().get(skillsWithParent.get(s));
skills.add(s);
}
}
}
ResumeBinding nonSkillsRes = ServiceHelper.getNonSkills(doc);
for(String parent: nonSkillsRes.getSkills().keySet()){
for(String skill : nonSkillsRes.getSkills().get(parent)){
resumeBinding.getSkills().get(parent).remove(skill);
}
}
Set<String> nonModelSkills = ServiceHelper.getNonModelSkills(ServiceHelper.getPossibleSkills(doc));
Map<String, Set<String>> nonModelSkillsWithParents = new HashMap<>();
List<String> toRemove = new ArrayList<>();
for(String nonModelSkill: nonModelSkills){
for(String parent: resumeBinding.getSkills().keySet()){
toRemove.addAll(resumeBinding.getSkills().get(parent).stream().filter(skill -> skill.equalsIgnoreCase(nonModelSkill)).map(skill -> nonModelSkill).collect(Collectors.toList()));
}
}
nonModelSkills.removeAll(toRemove);
for(String nonModelSkill: nonModelSkills){
Set potentialParents = findParents(executeWordNet(nonModelSkill));
nonModelSkillsWithParents.put(nonModelSkill, potentialParents);
}
resumeBinding.setPossibleSkills(nonModelSkillsWithParents);
//Clean up
Factory.deleteResource(doc);
Factory.deleteResource(corpus);

return resumeBinding;

}

In this method we are analyzing the entire biographic input. We had defined three steps for accomplish this.

The first block is responsible for initializing the resume with the skill.jape and education.jape rule. This initializations are made in the getResume method from the ServiceHelper.java class.

resumeBinding.setEducations(getEducations(defaultAnnotSet.get("EducationViaEduOrg"), defaultAnnotSet.get("EduOrg")));
resumeBinding.setSkills(getSkills(defaultAnnotSet.get("Skills"),

getNotHaveSkillsAnnotSet(defaultAnnotSet)));
return resumeBinding;

The second block is responsible for initializing the rules that discard the undesired skills, when they are in some constructions like modal verbs or future tenses. The imbricate for clauses, make the difference between the two sets: the set of the skill detected at step 1 and the set detected at the second step. The method responsible for initializing the annotation set is:

private static AnnotationSet getNotHaveSkillsAnnotSet(AnnotationSet defaultAnnotSet){
Set<String> annotTypesRequired = new HashSet();
annotTypesRequired.add("Do_not");
annotTypesRequired.add("Would_like_to");
annotTypesRequired.add("Will");
annotTypesRequired.add("Negative_modal");
return defaultAnnotSet.get(annotTypesRequired);
}

The methods for detecting the non skills is called getNonSkills

static Map<String, Set<String>> getNonSkills(AnnotationSet notHaveSkillsAnnotSet){
Set notHaveSkills = notHaveSkillsAnnotSet.stream()
.filter(annotation -> annotation.getFeatures().get("URI") != null && ((String) annotation.getFeatures().get("URI")).contains("skills-ontology"))
.map(annotation -> {
String str = ((String) annotation.getFeatures().get("URI"));
return str.substring(str.indexOf("#") + 1);
})
.collect(Collectors.toSet());
return OntologyManager.getInstance().findParent(notHaveSkills);
}

We are filtering data based on the tags that a skill has in the ontology, the URI and the “skill-ontology”.

The last step in our skill detection method is the obtaining the possible skill. We have defined previously methods for selecting skill from the ontology, we also developed some rules that detect the skills that are part of some construction. The annotation set for this chategory is getPossibleSills and is defined in ServiceHalper.java. What we do there is to include all the jape rules that are responsible for detecting this pattern:

public static AnnotationSet getPossibleSkills(Document doc){
AnnotationSet defaultAnnotSet = doc.getAnnotations();
Set<String> annotTypesRequired = new HashSet();
annotTypesRequired.add("Good_at");
annotTypesRequired.add("Have_experience");
annotTypesRequired.add("I_know");
annotTypesRequired.add("Write_in");
annotTypesRequired.add("Modal_verb_skill");
annotTypesRequired.add("Rock_at");
annotTypesRequired.add("The_best_at");
annotTypesRequired.add("Worked_with");
return defaultAnnotSet.get(annotTypesRequired);
}

After detecting the list of possible skills we execute of each skill the function functionWordNet wich interrogates the WordNet and returns the description of the skill.

Set potentialParents = findParents(executeWordNet(nonModelSkill);

The method executeWordNet interrogates the WordNet database and in return is searching only for the nouns defiinitions.

public String executeWordNet(String skill){
StringBuilder stringBuilder = new StringBuilder();
WordNetDatabase database = WordNetDatabase.getFileInstance();
Synset[] synsets = database.getSynsets(skill, SynsetType.NOUN);
if (synsets.length > 0) {
for (Synset synset : synsets) {
stringBuilder.append(synset.getDefinition()).append('\n');
}
}
return stringBuilder.toString();
}

of each potential skill we compute the set of potentialParents and again we are calling the method findParents.

public Set findParents(String definitionSkills) throws ExecutionException, ResourceInstantiationException {
Set set = new HashSet<>();
if(definitionSkills.length() > 0){
Corpus corpus = Factory.newCorpus("myCorpus");
Document doc = Factory.newDocument(definitionSkills);
corpus.add(doc);
annieController.setCorpus(corpus);
annieController.execute();

set = ServiceHelper.getParents(doc);

//Clean up
Factory.deleteResource(doc);
Factory.deleteResource(corpus);

}
return set;
}

This function interrogates the ontology and it returns a set of all the parents that a skill has inside the ontology.

Ontology Update Component

The second main objective of our system requirements was to add new skills to the ontology based on the user input. The business layer is responsible for the implementation of this component also. We already presented the cases when the ontology is update and how these steps are made. The methods that implement these features are included in the OntologyResource class that is based again on the OntologyManager.

Figure 6.9 Ontology Management Components

We are going to look in detail to the methods that inserts a new concept (skill) in the ontology. The method responsible for this is addSkillWithParent defined below:

public synchronized void addSkillWithParent(String skill, String parent, String head) throws FileNotFoundException {
InputStream in = FileManager.get().open(ONTOLOGY_FILE_PATH);
if (in == null) {
throw new IllegalArgumentException("File: not found");
}
Statement s = null;
Resource r = null;
Statement s1 = null;
Resource r1 = null;
boolean once = false;
String subEntry = "http://www.semanticweb.org/ontologies/skills-ontology#";
String type = "http://www.w3.org/1999/02/22-rdf-syntax-ns#type";
model.read(in, null);
StmtIterator iter = model.listStatements();
while (iter.hasNext()) {
Statement stmt = iter.nextStatement(); // get next statement
Property predicate = stmt.getPredicate(); // get the predicate
RDFNode object = stmt.getObject(); // get the object
object.getClass().getSuperclass();

if (object.toString().equals(subEntry + head) && !once) {
r = model.createResource(subEntry + parent);
s = model.createStatement(r, predicate, object);
model.add(s);
r1 = model.createResource(subEntry + skill);
s1 = model.createStatement(r1, predicate, r.toString());
model.add(s1);
} else if (s != null && s1!= null && predicate.toString().equals(type) && !once) {
r.addProperty(predicate, object);
r1.addProperty(predicate, object);
once = true;
File f = new File(ONTOLOGY_FILE_PATH);
FileOutputStream outputStream = new FileOutputStream(f);
model.write(outputStream, null);
try {
outputStream.flush();
outputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}

In the first lines we take care of ontology initialization and datafile management. The subentry was defined before and belongs to each concept in the ontology the type the same. Using gate predefined methods getPredicate and getObject we can also obtain the super class of each concept. The head is the two categories existing in the ontology in our case we are going to use “domain_specific_skills_and_components”. We have successfully detected the spot were to create the parent and the leaf we insert them and rewrite the entire ontology updated.

The API response that interprets this call is “/addNewSkill”. Beside the action that has on the ontology the addNewSkill method also updates the rating table associated to each user. For example the following code

User user = userService.findUserByUserName(binding.getUsername());
List<Rating> userRating = ratingService.findUserRating(user);
boolean wasFound = false;
if (userRating.size() > 0) {
for (Rating r : userRating) {
if (r.getParent().equals(binding.getParent())) {
r.setCount(r.getCount() + 1);
wasFound = true;
ratingService.save(r);
}
}
}
if (!wasFound) {
Rating rating = new Rating();
rating.setUser(user);
rating.setParent(binding.getParent());
rating.setCount(1l);
ratingService.save(rating);
}

We are obtaining the user resources and we are retrieving the information from the database based on RatingService class. We analyze the user assigned rating, if the user has already defined some parents, we try to find if there is already the same parent and increment the counter associated to it. In this way to the user the list of possible parent will be outputted decreasingly starting from the most probable. If the user do not has until this point defined any rating, we are creating a new table and set the value of it to 1.

User Managemt Component

The side functionalities to our system are the user managements system. We provide login, register and resume save on our platform. We are manipulating the model entities in order to obtain the desired results. One of the class responsible for implementing methods like create user, forgot password, get info and change personal detail, are ApiResource and UserResource.

Figure 6.10 ApiResorce and UserResource Class

We are going to have a close look to one method createUser that implements the response to the POST HTTP request addressed to the value “create”.

@POST
@Path(value = "create")
@Produces(MediaType.APPLICATION_JSON)
public Response createUser(UserBinding binding) throws UserAlreadyExistsException {
User existUser = userService.findUserByEmail(binding.getEmail());
if(null != existUser) {
LOG.error("Bad Request! User already exists ");
return Response.status(400).build();
}
existUser = userService.findUserByUserName(binding.getUsername());
if(null != existUser) {
LOG.error("Bad Request! User already exists ");
return Response.status(400).build();
}
User user = new User();
user.setUsername(binding.getUsername());
user.setFullname(binding.getFullname());
user.setEmail(binding.getEmail());
user.setPhone(binding.getPhone());
user.setAddress(binding.getAddress());
user.setAccountEnabled(true);
user.setDateOfbirth(binding.getDateOfBirth());
user.grantRole("USER");
String salt = BCrypt.gensalt();
user.setSalt(salt);
String password = binding.getPassword();
user.setPassword(BCrypt.hashpw(password, salt));
binding.setPassword(password);
userService.save(user);
return Response.ok().build();
}

Using the UserService we communicate with the database. We check if there is already a user that has the same username or password, if there is we return a response with status 400 with the error “Bad Request! User already exist”. We are defining after that the variables for the users fields, for the password we are using a salt method encryption that we are going to discuss in the following chapter. Another method that returns the user details and response to the GET http request were as the input is the user id.

@GET
@Path(value = "/{id}")
@Produces(MediaType.APPLICATION_JSON)
public UserBinding getUser(@PathParam("id") long id) {
User user = userService.find(id);
UserBinding userBinding = new UserBinding();
userBinding.setId(user.getId());
userBinding.setUsername(user.getUsername());
userBinding.setFullname(user.getFullname());
userBinding.setEmail(user.getEmail());
userBinding.setAddress(user.getAddress());
userBinding.setPhone(user.getPhone());
return userBinding;
}

Again we are communicating with the UserService that implements the interface between the database model and the business logic.

System Security

We are manipulating users personal details, so the security is an important aspect. The security layer is implemented inside the business layer.

From a REST-API perspective the system declines the unauthorized requests. Requests had to be done by an registered user.

//allow anonymous POSTs to login and change password
.antMatchers(HttpMethod.POST, "/api/login").permitAll()
.antMatchers("/api/create").permitAll()
.antMatchers("/api/forgotPass").permitAll()
//allow user GET and POST only user pages
.antMatchers("/api/user/**").hasRole("USER")
.antMatchers("/api/ontology/**").hasRole("USER")
.anyRequest().authenticated().and()

We permit the anonymous POST to the login page only.

From a user autentification perspective we are using tokens. The idea behind this is that once a user has provided his identity, the system can handle him a token as an authentication code that will be able to prove their integrity and let him go further. This a new way, different from the traditional approach of creating a session in the server and returning the cookie.

We are exchanging tokens that facilitate the communication with our server. The perfect attack for this behavior is man in the middle that waits for the token and finds the information needed.

For protecting for this type of attack we are going to use a key-hash message authentication code (HMAC). This type of hashing consists of cryptographic hash function in combination with a secret cryptographic key. The cryptographic strength of HMAC depends upon the cryptographic strength of the hash function the size and quality of the key used. We are going to use SHA256. As an example

HMAC_SHA256("key", "The quick brown fox jumps over the lazy dog") = 0xf7bc83f430538424b13298e6aa6fb143ef4d59a14946175997479dbc2d1a3cd8.

The token structure consist of:

User information

Hash value

The final token will have the following structure: Base64 the userBytes, Base64 the Hmac value obtained from userBytes concatenated with a separator “\\” between.

The class responsible for this is TokenHandler and TokenAuthenticationService. TokenHandler class is responsible for creation and registering the token received from user. Below you can find the method responsible for creation of the token for a user:

public String createTokenForUser(OntoUserDetails user) {
byte[] userBytes = toJSON(user);
byte[] hash = createHmac(userBytes);
final StringBuilder sb = new StringBuilder(170);
sb.append(toBase64(userBytes));
sb.append(SEPARATOR);
sb.append(toBase64(hash));
return sb.toString();
}

For HMAC we are using the functions from javax.crypto.

From a database perspective we are using Salt(cryptography) method for saving the password into the database. Salt is a method were you concatenate the password with a random generated number and after that is processed with a cryptographic hash function in our case jBCrypt that is a Java implementation of the Bruce Scheier’s Blowfish block cipher.

Salt technique is very useful for defending against dictionary attacks and about pre-computed rainbow table attacks.

We chose this implementation because is straight forward is simple to understand and because JAVA has a lack of pre-implemented hashing methods.

Presentation Layer

The presentation layer is represented by the user interface. User interface should present in a usable and friendly way the main functionalities of the system. To satisfy the system. We implemented our user interface using AngularJS as a controller. Every page has a dedicated controller that manages the entire flow and communicate using http requests with the API. One example to follow is the login function from the LoginController.js:

$scope.login = function () {
if($scope.username !== undefined && $scope.password !== undefined){
$http.post('ontology/api/login', { username: $scope.username, password: $scope.password }).success(function (result, status, headers) {
$scope.authenticated = true;
TokenStorage.store(headers('X-AUTH-TOKEN'));

// For display purposes only
$scope.token = JSON.parse(atob(TokenStorage.retrieve().split('.')[0]));

UserService.setUser({username: $scope.username, password: $scope.password, authenticated: $scope.authenticated, token: $scope.token});
$scope.isValidInstall = true;
$location.path('/home');
})
.error(function(result, status, headers){
if(status == 401){
$scope.isValidInstall = false;
$scope.modelDescription = 'Wrong username or password';
}
$scope.authenticated = false;

})
}
};

The function is taking analyzing the token as an autentification measure. We can see that communicates with API using an http post method using the paths “ontology/api/login”. If the autentification process faild the method returns 401 error code with ‘Wrong username or password’. All the processes have a controller that controls the process.

Testing and Validation

In the following chapter we are going to discuss the problem of system accuracy and present metrics to explain the accuracy. We are going to provide also a comparison between the two algorithms that solve the same problem: ALG1 and ALG2.

Metrics

We are going to talk firstly, from a theoretical point of view, about the metrics we used to test the system.

Our system can be labeled as a retrieval system. The instances are the skills and the task is to return a set of relevant skills from all the input words.

Precision

The first metric we are going to discus about is precision [25]. The precision is used in information extraction applications. Precision represents the total number of relevant skills that the system returned divided by the total number of words the system detected. In other words, precision represents “how useful the results are”[VX]. In our context, the system with high precision means that the system returned more relevant words, that can be considered skills, in detriments of irrelevant ones, that are only simple words. Precision is also called positive predictive value. To be more clear lets consider a simple example: consider we have a system that detects cars. We have a scene were there are 9 cars and some bicycles. The system detected 7 cars. If 4 of the cars identified by the system were correct, but the other 3 are actually bicycles, we can say that the system precision is 4/7. In our case: if the system detects 30 skills only 20 of them were relevant in a resume were there were 40 skills. In this case the precision is 20/30 because 20, represents the relevant information retrieved over 30 that represents the number of detected words. From a mathematically point of view the computation formula for precision is:

Recall

The second metric that we performed is recall [25]. The recall metric is also used in information extraction applications and represents the total number of relevant documents retrieved by the system divided by the total number of existing relevant documents. In other words “how complete are the results?”[VX]. In our context, a system with high recall means that the system successfully returned all the relevant skills. Recall is also called sensitivity. To be more clear we are going to use the before example with the system that detects cars using the same scene were the precision was 40/70 the recall was 40/90. In our case consider the same setup like before recall was 20/60 because 20 represents the detected skills and 60 represents the detected skills (20) plus the skills that should have been detected (40). From a mathematically point of view the computation formula of recall is:

A relevant graphic that represents very clear what is precision and recall and from what are made is the following:

Figure 7.1 Precision and recall from [25]

In our case the true positive (TP) represents: the word is a skill and the system has detected as a skill; the false negatives (FN) represents: the word is a skill and the system did not detected to be a skill; the false positives (FP) represents: the word is not a skill and the system has detected as a skill; the true negatives (TN) represents: the word is not a skill and the did not detected to be a skill;

F-Measure and Matthews Correlation Coefficient

Another metric we considered for our system is f-measure [25]. This is a metrics that combines in a harmonic mean the before two metrics: precision and recall. This metric measures the effectiveness of retrieval. The mathematical formula for f-measure:

Matthews Correlation Coefficient is the next metric we are taking into account. This metric is used in machine learning, measuring the quality of binary classifications. The metric is manipulating true positives, false positives and false negatives. The MMC is a correlation between the observed and predicted values.[25] The mathematical formula for MMC is :

System validity

Before presenting the results extracted from the resumes we had to make sure that our system fulfill the requirements.

The system should detect the new skills that are not in the ontology and based on the WordNet descriptions should find the possible parents were that skill should be added. In order to test this functionality we had to analyze the responses from WordNet to see if the parents proposed by the systems are the right ones. After the system returned the possible parents, the system should add the skill to the ontology to the right place. To present this features we are going to analyze the following example: the user inputs the text “I know to script”, the system detects script a word that is not in our ontology. The word is categorized as possible skill and is checked WordNet for the description. The response from the WordNet can be seen below:

Figure 7.2 WordNet response

The system receives these responses from WordNet and purpose to the user the following possible parents:

Figure 7.4 System Posible Parents List

From the WordNet response we can easily make the analogy :

composition “dramatic composition”;

writing “handwriting”

performance “used in preparing for a performance”

prepare “preparing”

We can see that the system successfully detected the possible parent skills from WordNet representation. Further we had to see if the new detected skill will be added to the ontology in the right place. Considering that in the previously example we picked “writing” as the parent of “script”, below you can see how the ontology was updated.

Figure 7.4 Before script insertion Figure 7.5 After script insertion

As the figures presents the system has successfully inserted the word in the right place.

Experimental Results

Testing and tracking the system has proven to be a difficult task, because we cannot automatize the testing process. We had to compute manually the true positives, false negatives, false positives and true negatives along with the numbers of words.

The purposes was to see what performance our system has and how adapts to natural tasks.

Bellow is an example how we track the progress of the resumes tested.

Figure 7.6 Experimental results table

In the above example we tracked the following fields:

CV – represents the cv-id form the list of 6000 cv extracted using the crawler;

TP – represents the true positives that the system detected during each individual test;

FN – represents the false positives that the system detected during each individual test;

FP – represents the false positives that the system detected during each individual test;

TN – represents the true negatives the system detected during each individual test;

Total Words – represents the total number of words the resume had;

Skill – represents the total number of skills the resume had;

Non-skill – represents the total number of non-skills words. Was computed as the Total Words – Skill;

Precision – was computed as: TP/(TP+FP);

Recall – was computed as: TP/(TP+FN);

F-Measure – was computed as: 2*(Precision*Recall)/(Precision+Recall);

MCC – was computed as:

((TP*TN)-(FP*FN))/sqrt((TP+FP)*(TP+FN)*(TN+FP)*(TN+FN));

We first run our system on resumes that contain only skills that are also in the ontology. By performing this test we are expecting to see a high value of recall, because almoust all the skills should be discovered.

We run the tests on 500 resumes that contain only skills extracted from the ontology, below are the results:

Figure 7.7 Experimental Results on Ontology based resumes

As a conclusion of the above graphic, we can see as we expected a big increase in recall metric. This increase is because we are using only skills that are in the ontology and based on our defined rules we should discover almost all of them. We can see a medium precision this is because our system also detected other words, that are not necessary skills, but is subjective aspect of our interpretation.

Bellow you can see an individual graphic for each metric we implemented precision, recall ,f-measure and MCC.

Figure 7.8 Precision experimental results for Ontology based resumes

We can see in the above graphic the individual precision metric graphic for each resume. The spikes that appear are because the system detected also some words that are not skills.

Figure 7.9 Recall experimental results for Ontology based resumes

We can see in the above graphic the individual recall metric graphic for each resume. As we expected we can see high values because of the skills that are easily recognized Figure 7.10 F-Measure for Ontology based resumes Figure 7.11 MCC for Ontology based CV

The second series of tests we have made was on resumes extracted by the crawler. The crawler extracted resumes from two different sites, this resumes are in a complete natural language without any perquisites and other modifications. The first graphic we are presenting is an average one between all the 500 resumes we run the tests. As we expected we do not encounter the same results as in the case were all the skills were extracted from the ontology and obey to the predefined rules. The results were quite good with an average over 0.5 proveen to be an reeliable solution.

Figure 7.12 Average results for random resumes

We can still se a high value for the recall this demonstrates the system is able to detect skills that are not in the ontology and can check with the WordNet and is able to update the ontology. The precision increase is also very high and this is very good for our system. F-measure tells us that our system is robust(do not miss many skills) and precise(that detects many useful data). Same as in the generated skills we are going to present each value individually.

Figure 7.13 Precision For 500 Random CVs

We can se that our system proved to be precise in the both case of ontology based resumes but also in the case of random cv’s.

Figure 7.14 Recall for 500 Random CVs

The high value of recall proves that the system is capable of detecting skills and to add skills to the ontology. Based on this two metrics we compute the next two of them f-measure and mcc.

Figure 7.15 F-Measure/MCC for 500 Random CVs

Methods comparison

We are going to present a comparison between the two methods of implementation:

Method based on JAPE rules and WordNet

Method based on multi-words expression pattern and Wikipedia

We compared the precision, recall, f-measure and MCC between the two methods.

Figure 7.16 Precision Comparision

Figure 7.17 Recall Comparision

Figure 7.18 F-Measure comparision

Figure 7.19 MCC Comparision

The set of information retrieval metrics presented above concluded us to the following remarks.

The method with Multi-Words and Wikipedia is a little better than the method with JAPE rules and WordNet. The factors that puts the first method in advantage was the fact that the systems set on rules is extending based on a corpus of training data, meanwhile at the JAPE rules method based, the system do not evolve the rules that were first designed do not expand to more. Another fact that force us to say that the first method is better is because of Wikipedia which has more domain specific words than WordNet. We tested only on dedicated domain like software development, so we have a lot of dedicated terms.

All the above are certified also on large data. We compute an average on all the resumes tested; below you can see the result:

Figure 7.20 Average Comparision

Even that the method with Multi-Words proof to be better, the method with JAPE rules proof also very good results in all tests.

In conclusion we had created two methods that solve the problem of skills detection with good results based on computed metrics.

User’s manual

In this chapter we are going to present the steps in order to use this system and the explicative steps in usage process.

Systems installation

The system was designed as a web page, so in order to use it on a web page you should have a dedicated server with the following requirements:

Ubuntu Server

NodeJS installed

Gulp installed

PostgresSQL installed

JAVA latest version

FTP client

Tomcat

In order to be installed the following steps should be followed:

Set NodeJS

Set Bower

Install Gulp

Install PostgresSQL

Download project

Replace the database settings

Download the library using ower

Collect frontend using gulp

Build the project using maven

Specify the tomcat files for starting the server.

Run the server

Detailed instructions to take the system locally and use it or further develop.

We recommend Indelij IDEA ultimate, JDK 1.8. Use pgAdmin or any other postgresql client, create a new database and call it ontology, after this step you should have an empty data base with this name. The steps are presented in the following photos:

Figure 8.1 Step1 instalation process

Figure 8.2 Step 2 instalation process

Figure 8.3 Step 3 instalation process

The next step is to set in:

webservice/src/main/resources/config/application.properties the username and password for the database.

The next step is to go via IDEA terminal to ui folder and download the libraries via command: bower install.

The next step in ui folder use command gulp build.

The next step is to change create, drop, recreate properties from dbupdater file.

The next step is to go to main folder and use command mvn clean install –P create. You should receive the following terminal result:

[INFO] ontology ……………………………………. SUCCESS [ 0.577 s]

[INFO] ontology-lib ………………………………… SUCCESS [ 4.807 s]

[INFO] webservice ………………………………….. SUCCESS [ 17.919 s]

[INFO] dbupdater …………………………………… SUCCESS [ 4.286 s]

[INFO] ui …………………………………………. SUCCESS [ 34.794 s]

[INFO] ––––––––––––––––––––––––

The next step is to add New Configuration in IDEA a Tomcat Server -> Local add the two war files to tomcat: the first one from webservice folder and from ui folder. Change the ontology war path from / to /ontology. Run tomacat the app should open on localhost.

System usage

The target users of the developed systems are all the end users from elders up to developers. The design was created to be very easy to use and to be understandable.

Figure 8.4 LogIn/Register Screen

As the system opens the first screen you encounter is the Login/Register, here the user can log in by using his credentials or can go to Register screen by clicking the “Register” button.

Figure 8.5 Register Screen

The Register screen consist of the personal details that will be later used in the CV creation. Each field must be field and must have a true information in it.

Figure 8.6 User home page

As the user log in the main page appear. In this screen there is one text input field, which is designated for the user autobiography and has three buttons, one for changing the personal details, one for logout and one which is the most important “Process autobiography” which start the resume processing.

Figure 8.7 Processed Text Screen

All the skills which exist in “Skills” ontology are recognized. All the nouns which fit the patterns are also recognized, they are displayed in “Possible skills” section. User can select a category for skill from the special pull-down, or write the category by himself. The list of categories is displayed according to user’s rating (see next figures). (If some of categories (parents) user has already inserted before and they existed in definitions, these categories are displayed to the top of the potential parents).

Figure 8.8 Possible Skill Screen

And user has opportunity to add the skill in his CV via entering the skill in the field at the right. Then after clicking the “Add” button the pull-down appears with the list of categories which were found by definitions of the word via WordNet.

Figure 8.9 Add New Skill

User can also add the additional information, which also will be represented in CV.

Figure 8.10 Additional information

After all corrections were made user has possibility to look at CV’s preview by clicking the appropriate button.

Figure 8.11 View CV Screen

By clicking “Save CV” button the CV’s is saved to the computer in pdf format.

Figure 8.12 PDF CV

The case if some of user’s personal information was changed also realized in the application. By clicking “Change your info-” button from “User home page” the “Account Information” page appears, where all the fields are editable. After making changes user clicks on “Change” button and the personal information is updated.

Figure 8.13 Change Personal Details

“Change the password” functionality (next figure) is available from “Login or Register” page via clicking “Forgot Password” button.

Figure 8.14 Change Password Screen

As seen above to change the password successfully it is necessary to enter Username and Email of registered account. If these data are incorrect the appropriate alerts appear.

Figure 8.15 Password errors

The presented example was processed the following text:

This is an example of user’s autobiography:

My name is Ben Carlisle. I was born on May 30, 1989 in the town of Neston, UK.

June 2005: University of Edinburgh, PhD in Economics. September 1999 – July 2001: University of Edinburgh, Economic Cybernetics, Master’s Degree. September 1995 – June 1999: The University of Derby, Strategic Management, Bachelor degree.

In 1994 I graduated Meru School K.C.S.E. During 1987-1991 I was studying at Ntamichiu Primary School K.C.P.E. In 1994 I graduated Meru School K.C.S.E. I know German and Russian languages. I want to be a programmer. Now I learnt java, HTML, css. I don’t know JavaScript. In nearest time I will learn Python. I am good at football. I rock at tennis. I have experience in AngularJS.

Additional Information: I want to take the position of Java-middle developer. I can start to work from April, 2016.

Conclusions

Contributions and achievements

The experimental prototype that has been presented in this paper is a solution for the general objectives established in the first chapters. Our solution came as a solution for the resume creation problem. Stated before creating a resume is not an easy task and we tend to make mistakes by trying to say our skills in a scientific and elaborated manner.

We have managed to create a experimental prototype that obtain only the necessary information like: skills and education, from an unstructured text. In order to obtain this information we divide the problem into smaller tasks.

First of all, we had to create a module for skill extraction. This module is based on JAPE rules that we defined and on the ontology for detecting the skills that are already inserted. We also had to take care of the skills that should not be in the CV, so we had to create rules for detecting and eliminating those skills too.

The second task was to create a mechanism for obtaining information for a new skill that is not in the ontology. We need this mechanism to accomplish one of the secondary tasks – update the ontology with new skills. In order to update the ontology we need to know were the skills should be inserted, their parent in the tree structure. We are using WordNet to find their definition and based on that definition we find the parent to which we assign the skill. WordNet is the largest dictionary online that offers concepts and their connection with other concepts.

The third component is the ontology update component, which by using the first two steps inserts a skill in the right place. This component traverses the ontology and adds the skill.

The last component is the testing process. We had to create a crawler for resumes or letter of application extraction from two sites. This crawler automatically extracts the resumes and provides them to us in a usable form. Based on this CV’s we enlarge the JAPE rules for detecting new skills. For testing the system we computed the following information extraction metrics: precision, recall, f-type, mcc.

Results

To test the system we computed some information extraction metrics. This metrics prove to us that the system is quite god with small imperfections. We obtained a precision of 52%, this means that our system has detected necessary information but also detected some information that should not be in the resumes. Recall, that represent the completeness of the system is 53%, this means that the system detected in average 53% of the skills in a CV.

As a conclusion, the results are favorable the drawbacks are because the fixed number of JAPE rules and we tend to skip some important information.

Further Development

Every system that offers large flexibility has low performance. This is our case but also there are some solutions that can make our system way better.

Enlargement of the rules set is the most important step in improving the systems performance. By enlarging the set of rules the system will be able to detect more skills so the performance is better.

We are considering of creating an artificial intelligent system that can detect the new constructions and automatically generate JAPE rules from the new detected constructions. In the same way there can be created an intelligent system that can detect the context the user is talking about. By implementing this system, we will be able to use more ontology in the same project. In this way we can enlarge the knowledge base of the system. We will be able to use domain ontologies and we will be able to create a management system for these ontologies.

Another feature that can be added is the rules for different skills detection. Skills like hobbies or rules for detecting the project the user worked on.

The most important feature that should be added is the companies’ admin area. Here the companies that are interested in candidates can log in and can search for specific skills. In this way we are going to satisfy also the HR companies that are in constant search for new employees. We are going to add value to the services offered by them by filtering the candidates and providing them an exact and precise resume, in this way the HR companies will save time and resources in filtering the candidates. A mechanism for subscription for companies can be implemented, where they can receive constant updates each time a new resume is uploaded. For the user can we implemented a mechanism of notification when somebody is searching for candidates that have his abilities. Also for the user, can be implemented a mechanism for notification when somebody looked at his CV.

Another feature that can be implemented is an evaluation form, were system can determine the level for each user abilities. In this way a more precise resume can be generated.

Talking about the performance in time, we can improve this by paralyzing some actions, like API calls access to the database and access to the ontology. Talking about the dictionaries, there can we used more than just only one –WordNet. The reason why we consider is important to use more than one, is because in some situations WordNet do not return any result and our system cannot find any possible parents. By using more we can find at least one description.

In conclusion the system we presented in this paper represents a proof-of-concept of the method suggested by us. This project has to be considered as a Minimum Viable Project that satisfy only the basic requirements but offers a usable solution. The performances are decent and we succeed in covering all the initial requirements and objectives.

Bibliography

Amy Gallo. “How to Wrote a Resume That Stands Out” 19.12.1014. [ONLINE] Available at: https://hbr.org/2014/12/how-to-write-a-resume-that-stands-out

UK Government. “The Employers Skill Survey follows up the 1999”. [ONLINE] Available at: www.skillsbase.dfes.gov.uk

Haifa Al-Buainain. 2015. [ONLINE] Available at:

http://faculty.qu.edu.qa/drhaifa

Helena Pozniak. “Get your CV in tip-top shape”. In The Telegraph, 2014. [ONLINE] Available at: http://www.telegraph.co.uk/sponsored/finance/your-bank/11230744/write-perfect-cv.html

Mildred Talabi. “How to grab an employers attention in 30 seconds”. [ONLINE] Available at: http://www.theguardian.com/careers/careers-blog/cv-advice-grab-employers-attention

Sherwyn P. Morreale, Michael M. Osborn and Judy C. Pearson. “Why Communication is Important: A Rationale for the Centrality of the Study of Communication”. In Journal of the Association for Communication Administration, Valume 29, 2000.

Fernando Gutierrez, Dejing Dou, Stephen Fickas, Daya Wimalasuriya and Hui Zong, "A Hybrid Ontology-based Information Extraction System", Journal of Information Science, September 2015.

Darshika N. Koggalahewa and Asoka S. Karunananda, "Ontology Guided Semantic Self Learning Framework", International Journal of Knowledge Engineering, Vol. 1, No. 1, June 2015.

A. Chambolle and T. Pock, "A First-Order Primal-Dual Algorithm for Convex Problems with Applications to Imaging," Journal of Mathematical Imaging and Vision, vol. 40, pp. 120-145, 2011.

WordNet. [ONLINE] Available at: https://wordnet.princeton.edu

Arunasish Sen, Anannya Das, Kuntal Ghosh and Soham Ghosh, "Screener: a System for Extracting Education Related InformationFrom Resumes Using Text Based Information Extraction System", 2012 International Conference on Computer and Software Modeling (ICCSM 2012)IPCSIT vol. 54 (2012) © (2012) IACSIT Press, SingaporeDOI: 10.7763/IPCSIT.2012.V54.06

Mihaela Dinsoreanu, Cornel Rat, Sorin Suciu, Tudor Cioara, Ionut Anghel, Ioan Dragan, Livia Ardelean, Ion Carja – Hierarchical Data Models for the ArhiNet Ontology Representation and Processing – Automation, Computers, Applied Mathematics Journal, Volume 18, Number 2, 2009

Meng Zhao, Faizan Javed, Ferosh Jacob, Matt McNair, "SKILL: A System for Skill Identification and Normalization".,5550-A Peachtree Parkway, Norcross, GA 30092, USA

Screener 2

Christina Feilmayr, Klaudija Vojinovic and Birgit Proll. “Design a Multi-Dimensional Space for Hybrid Information Extraction”. In BRIDGE Bruckenschlagprogramm-2, September 2012.

D.E. Appelt and D.J. Israel, "Introduction to Information Extraction", AI Communications, Vol.12, No.3, pp.161-172, IOS Press,Amsterdam, The Netherlands, 1999.

Sumit Maheshwari, Abhishek Sainani, P Krishna Reddy, “An Approach to Extract Special Skills to Improve the Performance of Resume Selection”, 6th International Workshop on Databases in Networked Information System, (DNIS2010).

Kun Yu, Gang Guan, Ming Zhou, „Resume Information Extraction with cascaded Hybrid Model“

Liviu Sebastian Matei, Ștefan Trăușan Matu,”Named Entity Recognition”, Universitatea Politehnica din București, Splaiul Independenței 313, 060042, București, Institutul de Cercetări în Inteligența Artificială,Calea 13 Septembrie 13, 050711, București.

T. R. Gruber,”Toward principles for the design of ontologies used for knowledge sharing”,. International Journal of Human-Computer Studies, Vol. 43, Issues 4-5, November 1995, pp. 907-928

Natalya F. Noy and Deborah L. McGuinnes. “Ontology Development 101: Aguide to creating Your First Ontology”

Hoifung Poon and Pedro Domingos. “Unsupervised Ontology Induction from text”.

Chris Biemann. “Ontology Learning from Text: A survery of Methods”.

GATE [ONLINE]. Available at: https://gate.ac.uk/sale/tao/splitch6.html#chap:annie

Information Retrieval Metrics. [ONLINE]. Available at: https://en.wikipedia.org/wiki/Precision_and_recall

Skill Ontology (skill.owl).

Appendix 1

Ontology-lib

Ontology Manager

package ontology.service;
import com.hp.hpl.jena.rdf.model.*;
import com.hp.hpl.jena.util.FileManager;
import org.slf4j.LoggerFactory;
import java.io.*;
import java.util.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class OntologyManager {
private static final org.slf4j.Logger LOGGER = LoggerFactory.getLogger(OntologyManager.class);
private static volatile OntologyManager instance;
static final String ONTOLOGY_FILE_PATH = String.format("%sskills.owl", ServiceHelper.RESOURCE_PATH);
private Model model;
private OntologyManager() {
model = ModelFactory.createDefaultModel();
}
public static OntologyManager getInstance() {
if (instance == null) {
synchronized (OntologyManager.class) {
if (instance == null) {
instance = new OntologyManager();
}
}
}
return instance;
}
public synchronized Map<String, Set<String>> findParent(Set<String> skills) {
try (InputStream in = FileManager.get().open(ONTOLOGY_FILE_PATH)) {
if (in == null) {
throw new IllegalArgumentException("File: not found");
}
model.read(in, null);
Map<String, String> map = getSuperClasses(model);
Set<String> parents = new HashSet<>();
Map<String, Set<String>> skillsWithParent = new HashMap<>();
for (String skill : skills) {
String parent = map.get(skill);
parents.add(parent);
}

for (String parent : parents) {
Set<String> subSkills = new HashSet<>();
for (String skill : skills) {
if (parent.equals(map.get(skill))) {
subSkills.add(skill);
}
}
skillsWithParent.put(parent, subSkills);
}
return skillsWithParent;
} catch (IOException e) {
LOGGER.error(e.toString());
}
return null;
}
public synchronized Map<String, String> findAllSkillParent() {
try (InputStream in = FileManager.get().open(ONTOLOGY_FILE_PATH)) {
if (in == null) {
throw new IllegalArgumentException("File: not found");
}
model.read(in, null);
Map<String, String> map = getSuperClasses(model);
return map;
} catch (IOException e) {
LOGGER.error(e.toString());
}
return null;
}
public synchronized Map<String, String> findSkillParent(String skill) {
try (InputStream in = FileManager.get().open(ONTOLOGY_FILE_PATH)) {
if (in == null) {
throw new IllegalArgumentException("File: not found");
}
model.read(in, null);
Map<String, String> map = getSuperClasses(model);
Map<String, String> skillsWithParent = new HashMap<>();
String parent = map.get(skill);
skillsWithParent.put(skill, parent);
return skillsWithParent;
} catch (IOException e) {
LOGGER.error(e.toString());
}

return null;
}
private Map<String, String> getSuperClasses(Model model) {
Map<String, String> map = new TreeMap<>();
StmtIterator iter = model.listStatements();
// print out the predicate, subject and object of each statement
while (iter.hasNext()) {
Statement stmt = iter.nextStatement(); // get next statement
com.hp.hpl.jena.rdf.model.Resource subject = stmt.getSubject(); // get the subject
Property predicate = stmt.getPredicate(); // get the predicate
RDFNode object = stmt.getObject(); // get the object
object.getClass().getSuperclass();
if (predicate.toString().contains("subClassOf")) {
String concept = getConceptName(subject.toString());
String parent = getConceptName(object.toString());
map.put(concept, parent);
}
}
return map;
}
private String getConceptName(String concept) {
String regex = "#.*";
Matcher m;
Pattern pattern = Pattern.compile(regex);
m = pattern.matcher(concept);
if (m.find())
return m.group(0).substring(1, m.group(0).length());
return null;
}

public synchronized boolean addSkill(String skill, String parent) throws FileNotFoundException {
InputStream in = FileManager.get().open(ONTOLOGY_FILE_PATH);
if (in == null) {
throw new IllegalArgumentException("File: not found");
}
Statement s = null;
Resource r = null;
boolean once = false;
boolean written = false;
String subEntry = "http://www.semanticweb.org/ontologies/skills-ontology#";
String type = "http://www.w3.org/1999/02/22-rdf-syntax-ns#type";
model.read(in, null);
StmtIterator iter = model.listStatements();
while (iter.hasNext()) {
Statement stmt = iter.nextStatement(); // get next statement
Property predicate = stmt.getPredicate(); // get the predicate
RDFNode object = stmt.getObject(); // get the object
object.getClass().getSuperclass();

if (object.toString().equals(subEntry + parent)) {
r = model.createResource(subEntry + skill);
s = model.createStatement(r, predicate, object);
model.add(s);
} else if (s != null && predicate.toString().equals(type) && !once) {
r.addProperty(predicate, object);
once = true;

File f = new File(ONTOLOGY_FILE_PATH);
FileOutputStream outputStream = new FileOutputStream(f);
model.write(outputStream, null);
try {
outputStream.flush();
outputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
written = true;
}
}
return written;
}
public synchronized void addSkillWithParent(String skill, String parent, String head) throws FileNotFoundException {
InputStream in = FileManager.get().open(ONTOLOGY_FILE_PATH);
if (in == null) {
throw new IllegalArgumentException("File: not found");
}
Statement s = null;
Resource r = null;
Statement s1 = null;
Resource r1 = null;
boolean once = false;
String subEntry = "http://www.semanticweb.org/ontologies/skills-ontology#";
String type = "http://www.w3.org/1999/02/22-rdf-syntax-ns#type";
model.read(in, null);
StmtIterator iter = model.listStatements();
while (iter.hasNext()) {
Statement stmt = iter.nextStatement(); // get next statement
Property predicate = stmt.getPredicate(); // get the predicate
RDFNode object = stmt.getObject(); // get the object
object.getClass().getSuperclass();

if (object.toString().equals(subEntry + head) && !once) {
r = model.createResource(subEntry + parent);
s = model.createStatement(r, predicate, object);
model.add(s);
r1 = model.createResource(subEntry + skill);
s1 = model.createStatement(r1, predicate, r.toString());
model.add(s1);
} else if (s != null && s1!= null && predicate.toString().equals(type) && !once) {
r.addProperty(predicate, object);
r1.addProperty(predicate, object);
once = true;
File f = new File(ONTOLOGY_FILE_PATH);
FileOutputStream outputStream = new FileOutputStream(f);
model.write(outputStream, null);
try {
outputStream.flush();
outputStream.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
}
}

Ontology Service

package ontology.service;
import edu.smu.tspell.wordnet.Synset;
import edu.smu.tspell.wordnet.SynsetType;
import edu.smu.tspell.wordnet.WordNetDatabase;
import gate.Corpus;
import gate.Document;
import gate.Factory;
import gate.FeatureMap;
import gate.creole.ExecutionException;
import gate.creole.ResourceInstantiationException;
import gate.creole.SerialAnalyserController;
import gate.util.GateException;
import ontology.bindings.ResumeBinding;
import ontology.service.annie.*;
import ontology.service.lang.LanguageIdentifier;
import org.slf4j.LoggerFactory;
import java.util.*;
import java.util.stream.Collectors;
public class OntologyService {
private static final org.slf4j.Logger LOGGER = LoggerFactory.getLogger(OntologyService.class);
private static volatile OntologyService instance;
private SerialAnalyserController annieController;
private OntologyService() throws GateException {
FeatureMap features = Factory.newFeatureMap();
ServiceHelper.initGate();
ServiceHelper.loadAnnie();
ServiceHelper.initWordNet();
annieController = ServiceHelper.initAnnie();
features.clear();
annieController.add(DocumentReset.PR(features));
features.clear();
annieController.add(LanguageIdentifier.PR(features));
features.clear();
annieController.add(Tokenizer.PR(features));
features.clear();
annieController.add(SentenceSplitter.PR(features));
features.clear();
annieController.add(POSTagger.PR(features));
features.clear();
annieController.add(Morpher.PR(features));
features.clear();
annieController.add(DefaultGazetteer.PR(features));
features.put("listPR", annieController.getPRs());
features.put("ontologyFilePath", OntologyManager.ONTOLOGY_FILE_PATH);
annieController.add(Gazetteer.PR(features));
features.clear();
features.put("grammarURL", "resources/NE/main.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/skills.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/nouns.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/do_not.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/would_like_to.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/will.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/negative_modal.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/edu_org.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/degree.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/good_at.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/have_experience.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/i_know.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/modal_verb_skill.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/write_in.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/rock_at.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/the_best_at.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/worked_with.jape");
annieController.add(Transducer.PR(features));
features.clear();
features.put("grammarURL", "jape_rules/education_via_edu_org.jape");
annieController.add(Transducer.PR(features));
features.clear();
annieController.add(AliasMatcher.PR(features));
features.clear();

annieController.add(PronominalCoreference.PR(features));
/* Testing that all PRs have the required parameters to run */
System.out.println("Testing PRs");
annieController.getOffendingPocessingResources();
}
public static OntologyService getInstance() {
if (instance == null) {
synchronized (OntologyService.class) {
if (instance == null) {
try {
instance = new OntologyService();
} catch (GateException e) {
LOGGER.error(e.toString());
throw new RuntimeException(e);
}
}
}
}
return instance;
}
public ResumeBinding processText(String text) throws ResourceInstantiationException, ExecutionException {
Corpus corpus = Factory.newCorpus("myCorpus");
Document doc = Factory.newDocument(text);
corpus.add(doc);
annieController.setCorpus(corpus);
annieController.execute();
ResumeBinding resumeBinding = ServiceHelper.getResume(doc);
Map<String, String> skillsWithParent = getOntologyManager().findAllSkillParent();
for(String s: text.split("\\W+")){
if(skillsWithParent.containsKey(s)){
if(!resumeBinding.getSkills().containsKey(skillsWithParent.get(s))){
Set<String> skill = new HashSet();
skill.add(s);
resumeBinding.getSkills().put(skillsWithParent.get(s), skill);
}
else{
Set<String> skills = resumeBinding.getSkills().get(skillsWithParent.get(s));
skills.add(s);
}
}
}
ResumeBinding nonSkillsRes = ServiceHelper.getNonSkills(doc);
for(String parent: nonSkillsRes.getSkills().keySet()){
for(String skill : nonSkillsRes.getSkills().get(parent)){
resumeBinding.getSkills().get(parent).remove(skill);
}
}
Set<String> nonModelSkills = ServiceHelper.getNonModelSkills(ServiceHelper.getPossibleSkills(doc));
Map<String, Set<String>> nonModelSkillsWithParents = new HashMap<>();
List<String> toRemove = new ArrayList<>();
for(String nonModelSkill: nonModelSkills){
for(String parent: resumeBinding.getSkills().keySet()){
toRemove.addAll(resumeBinding.getSkills().get(parent).stream().filter(skill -> skill.equalsIgnoreCase(nonModelSkill)).map(skill -> nonModelSkill).collect(Collectors.toList()));
}
}
nonModelSkills.removeAll(toRemove);
for(String nonModelSkill: nonModelSkills){
Set potentialParents = findParents(executeWordNet(nonModelSkill));
nonModelSkillsWithParents.put(nonModelSkill, potentialParents);
}
resumeBinding.setPossibleSkills(nonModelSkillsWithParents);
//Clean up
Factory.deleteResource(doc);
Factory.deleteResource(corpus);
return resumeBinding;
}
public String executeWordNet(String skill){
StringBuilder stringBuilder = new StringBuilder();
WordNetDatabase database = WordNetDatabase.getFileInstance();
Synset[] synsets = database.getSynsets(skill, SynsetType.NOUN);
if (synsets.length > 0) {
for (Synset synset : synsets) {
stringBuilder.append(synset.getDefinition()).append('\n');
}
}
return stringBuilder.toString();
}
public Set findParents(String definitionSkills) throws ExecutionException, ResourceInstantiationException {
Set set = new HashSet<>();
if(definitionSkills.length() > 0){
Corpus corpus = Factory.newCorpus("myCorpus");
Document doc = Factory.newDocument(definitionSkills);
corpus.add(doc);
annieController.setCorpus(corpus);
annieController.execute();
set = ServiceHelper.getParents(doc);
//Clean up
Factory.deleteResource(doc);
Factory.deleteResource(corpus);
}
return set;
}
public OntologyManager getOntologyManager() {
return OntologyManager.getInstance();
}
public void destroy() {
synchronized (OntologyService.class) {
Factory.deleteResource(annieController);
annieController = null;
instance = null;
}
}
}

TokenHandler

package ontology.config.security;
import com.fasterxml.jackson.core.JsonProcessingException;
import com.fasterxml.jackson.databind.ObjectMapper;
import org.slf4j.Logger;
import org.slf4j.LoggerFactory;
import javax.crypto.Mac;
import javax.crypto.spec.SecretKeySpec;
import javax.xml.bind.DatatypeConverter;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.security.InvalidKeyException;
import java.security.NoSuchAlgorithmException;
import java.util.Arrays;
import java.util.Date;
/**
* @author Neste Vlad
*/
public class TokenHandler {
private static final String HMAC_ALGO = "HmacSHA256";
private static final String SEPARATOR = ".";
private static final String SEPARATOR_SPLITTER = "\\.";
private static final Logger LOG = LoggerFactory.getLogger(TokenHandler.class);
private final Mac hmac;
public TokenHandler(byte[] secretKey) {
try {
hmac = Mac.getInstance(HMAC_ALGO);
hmac.init(new SecretKeySpec(secretKey, HMAC_ALGO));
} catch (NoSuchAlgorithmException | InvalidKeyException e) {
throw new IllegalStateException("failed to initialize HMAC: " + e.getMessage(), e);
}
}
public OntoUserDetails parseUserFromToken(String token) {
final String[] parts = token.split(SEPARATOR_SPLITTER);
if (parts.length != 2 || parts[0].length() <= 0 || parts[1].length() <= 0) {
LOG.error("Token parts are not valid");
return null;
}
try {
final byte[] userBytes = fromBase64(parts[0]);
final byte[] hash = fromBase64(parts[1]);
byte[] hmac1 = createHmac(userBytes);
boolean hashValid = Arrays.equals(hmac1, hash);
if (!hashValid) {
LOG.error("Hash is not valid");
return null;
}
final OntoUserDetails user = fromJSON(userBytes);
if (new Date().getTime() < user.getExpires()) {
return user;
}
} catch (IllegalArgumentException e) {
LOG.error("Illegal argument", e);
}
return null;
}
public String createTokenForUser(OntoUserDetails user) {
byte[] userBytes = toJSON(user);
byte[] hash = createHmac(userBytes);
final StringBuilder sb = new StringBuilder(170);
sb.append(toBase64(userBytes));
sb.append(SEPARATOR);
sb.append(toBase64(hash));
return sb.toString();
}
private OntoUserDetails fromJSON(final byte[] userBytes) {
try {
return new ObjectMapper().readValue(new ByteArrayInputStream(userBytes), OntoUserDetails.class);
} catch (IOException e) {
throw new IllegalStateException(e);
}
}
private byte[] toJSON(OntoUserDetails user) {
try {
return new ObjectMapper().writeValueAsBytes(user);
} catch (JsonProcessingException e) {
throw new IllegalStateException(e);
}
}
private String toBase64(byte[] content) {
return DatatypeConverter.printBase64Binary(content);
}
private byte[] fromBase64(String content) {
return DatatypeConverter.parseBase64Binary(content);
}
// synchronized to guard internal hmac object
private synchronized byte[] createHmac(byte[] content) {
return hmac.doFinal(content);
}
}

Similar Posts

  • Caracteristicile Dezvoltarii Psihofizice ale Scolarului Mic

    CUPRINS PARTEA I Argument………………………………………………………………. 3 I. Caracteristicile dezvoltării psihofizice ale școlarului mic ……………6 II. Importanța activităților evaluative în procesul de învățământ 2.1.Procesuldeinvatamant- caracteristici; structură generală…………….15 2.2. Evaluarea în procesul de învățământ- caracterizare generală………18 2.3. Evaluarea rezultatelor școlare–necesitate a procesului didactic……25 2.4. Funcțiile evaluării…………………………………………………..29 2.5.Operațiile evaluării………………………………………………….33 2.6.Cerințele și criteriile notării și aprecierii rezultatelor școlare……….35 2.7.Factorii perturbatori…

  • Anatomia Si Fiziologia Aparatului Renal

    Capitolul I Anatomia și fiziologia aparatului renal Noțiuni de anatomie Aparatul renal este alcătuit din cei doi rinichi și de căile evacuatoare ale urinei: calice, bazinete, uretere, vezica urinară și uretră. Rinichii sunt organe pereche, retroperitoneale, de forma unor boabe de fasole, cu o lungime de aproximativ 11–14 cm și lățime de 6 cm. Sunt localizați de-o parte și…

  • Rolui Jocului Didactic Si Conceptul de Joc

    === rolui jocului didactic si conceptul de joc === CUPRINS INTRODUCERE Reforma învățământului preuniversitar a vizat și învățământul preprimar astfel că, în anul 2008 a apărut Curriculum pentru învățământul preșcolar (3-6/7 ani) care pune un accent deosebit pe importanța educației timpurii. Aceasta, ca primă treaptă de pregătire pentru educația formală, asigură intrarea copilului în sistemul…

  • Evaluarea In Educatia Timpurie

    evaluarea in educatia timpurie Evaluarea este un proces didactic complex, integrat structural și funcțional în activitatea grădiniței. Teoria și practica evaluării în educație înregistrează o mare varietate de moduri de abordare și de înțelegere a rolului acțiunilor evaluative, dar noul Curriculum pentru învățământul preșcolar prefigurează printre tendințele de schimbare și diversificarea strategiilor de predare și…

  • Efectele Practicarii Artelor Martialesporturilor de Lupta Asupra Sistemului Psihofizic

    === 84534b9cd608200762b73aeb20783555136fd9b0_117114_1 === UNIVERSITATEA ……………….. FACULTATEA DE ………………….. SPECIALIZARE:KINETOTERAPIE LUCRARE DE LICENȚĂ/DISERTAȚIE Coordonator științific Prof.univ. dr. Absolvent, 2018 UNIVERSITATEA ……………….. FACULTATEA DE ………………….. SPECIALIZARE:KINETOTERAPIE EFECTELE PRACTICARII ARTELOR MARȚIALE/SPORTURILOR DE LUPTĂ ASUPRA SISTEMULUI PSIHOFIZIC Coordonator științific Prof.univ. dr. Absolvent, 2018 CUPRINS CAPITOLUL I. MOTIVAȚIA ALEGERII TEMEI……………………………………………… CAPITOLUL II. SCURT ISTORIC AL APARIȚIEI ȘI DEZVOLTĂRII ARTELOR MARȚIALE…

  • Hemofilia

    Cuprins Introducere Cu secole în urmă oamenii au constatat că hemofilia apare la un nivel familial, la membrii de sex masculin din familie, de regulă pe linie maternă existând aceleași manifestări hemoragice. Cu toate aceste evidențe, modul de transmitere al hemofiliei din cadrul familiei a rămas necunoscut până mai târziu în secolul XX, în zilele…