Supervisors: Asist. dr. Vlad-Sebastian Ionescu [618962]
BABES ,-BOLYAI UNIVERSITY
FACULTY OF MATHEMATICS AND COMPUTER
SCIENCE
Learning Web Content Extraction from
DOM
BSc Thesis
Student: [anonimizat] ,iu
Supervisors: Asist. dr. Vlad-Sebastian Ionescu
Prof. dr. Gabriela Czibula
2018
UNIVERSITATEA BABES ¸-BOLYAI CLUJ-NAPOCA
FACULTATEA DE MATEMATIC ˇA S ¸I INFORMATIC ˇA
ˆInv˘ at ,are Automat˘ a a Extract ,iei de
Cont ,inut Web din DOM
Lucrare de Diplom˘ a
Student: [anonimizat] ,iu
Coordonatori S ,tiint ,ifici: Asist. dr. Vlad-Sebastian Ionescu
Prof. dr. Gabriela Czibula
2018
Nichita Ut ,iu Learning Web Content Extraction from DOM
Contents
1 Introduction 5
2 Content Extraction and Machine Learning in Literature 7
2.1 AI and Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.1 History . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.1.2 Machine Learning . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.1.3 Artificial Neural Networks . . . . . . . . . . . . . . . . . . . . . . 12
2.2 Content Extraction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Introduction to Web Content Extraction . . . . . . . . . . . . . . . 13
2.2.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.3 Existing Methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.3 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3 Proposed Machine Learning Approach to Content Extraction 17
3.1 Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.1 Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.1.2 Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2 Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Experimental Design . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.3 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24
3.4 Conclusions and Potential Improvements . . . . . . . . . . . . . . . . . . 25
4 Web Content Extraction Application 27
4.1 Usecases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
4.1.1 Users . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
4.1.2 Admins . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
4.2 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.1 Frontend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.2 Backend . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
4.2.3 SQL and Redis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2.4 LearnHtml Library . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.3 System Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.1 Interactions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.3.2 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
4.4 Technologies Used and Implementation . . . . . . . . . . . . . . . . . . . 35
4.4.1 React . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3
Nichita Ut ,iu Learning Web Content Extraction from DOM
4.4.2 Django REST Framework . . . . . . . . . . . . . . . . . . . . . . . 35
4.4.3 Scikit-Learn . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
4.4.4 Tensorflow and Keras . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.5 User Manual . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5.1 Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
4.5.2 CLI . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5 Conclusions and Future Work 45
Bibliography 47
4
Nichita Ut ,iu Learning Web Content Extraction from DOM
Chapter 1
Introduction
In this thesis, we approach the problem of web content extraction . Web content
extraction is a process which aims to take web documents and extract the semantically-
significant portions of them. This part of pages is referred to as content . This topic
is a longstanding problem in computer science, with works dating back even to the
90s [28]. However, improvements in performance are still sought after to this day as
some authors note [33].
We aim to present our original approach employing information from the DOM tree,
which achieves state-of-the-art results on the Dragnet [45] dataset. Based on this method,
we implement a simple web application for classifying web content. The rest of this
thesis is structured as follows:
Chapter 2 consists of a gentle introduction to the field of artificial intelligence (AI) and
machine learning (ML). We proceed to present the problem of content extraction and the
various related works which attempt to solve it. We primarily focus on the more recent
ones which make use of machine learning techniques and hold the current state-of-art
results.
We follow up with Chapter 3 which consists of a detailed presentation of our method.
We begin by describing the process we use and the rationale behind its design. We then
put forth a set of experiments meant to assess the performance of different ML models
on two well-known datasets. Finally, we show the numerical results of our experiments
and compare them with existing methods. We note that we test on the Cleaneval [3]
and Dragnet [45] in terms of F1scores. With our best performing models, we obtain
competitive results on the former and outperform the state-of-the-art on the latter with
a score of 0:96.
Based on the results obtained in the previous chapter, in Chapter 4 we propose
a web application for content extraction based on this approach. First, we present
the application from a specification and design standpoint. Afterwards, we discuss
implementation details and technologies used to develop the end-product.
5
Nichita Ut ,iu Learning Web Content Extraction from DOM
Finally, in Chapter 5 we summarize the content of this thesis. We discuss the im-
plications of both its theoretical and practical aspects whilst putting forward potential
improvements
The original theoretical aspects and results presented in Chapter 3 are part of the
following paper:
N. Ut ,iuand V .-S. Ionescu, “Learning Web Content Extraction with DOM
Features”, 2018 IEEE 14th International Conference on Intelligent Computer
Communication and Processing – ICCP 2018 , Cluj-Napoca, Romania, September
6–8, 2018 — Submitted
Acknowledgement
We gratefully acknowledge the support of NVIDIA Corporation with the donation
of the GTX Titan X GPU used for this research.
6
Nichita Ut ,iu Learning Web Content Extraction from DOM
Chapter 2
Content Extraction and Machine
Learning in Literature
Web content extraction it the process that aims to separate the content of interest of
webpages (eg. article text, comments, etc.) from semantically-irrelevant elements such
as decorations, navigation elements and others [33].
It is a topic that has been has been present in literature at least since the 90s [28].
Over the years, its been an active domain of interest in literature as the corpus of web
content has grown significantly over the years. This increase is synonymous with an
influx of non-machine-readable information starting being available online — content
which content extraction seeks to tap into. With the advance of machine learning models
and increased availability of better hardware, web content extraction has began making
use of such methods [14].
In the following sections we present a short introduction to both the topics and their
relationship.
2.1 AI and Machine Learning
Artificial Intelligence (AI) is too broad a topic be thoroughly discussed in this section,
however we try to briefly present the historical context and some general aspects and
particularities that concern our work. Most of this section is inspired by Russell and
Norvig’s seminal work Artificial intelligence: a modern approach [48].
2.1.1 History
AI has a very long history even prior to its inception as a field of computer science.
It is an idea that has been popularized through philosophy since the 17th century. Some
pioneering proponents of the idea of “artificial thoughts” were figures such as Leibniz,
Descartes and Hobbes.
The three were adherents of the theory of rationalism that began emerging contempo-
rary with advancements in mathematics [48, p. 6, 36, p. 36–42]. Hobbes in particular,
argues that all living organisms can be reduced to a series of reproducible processes [26,
ch. 5]. These theories remain mostly based on the formal logic of mathematics until the
7
Nichita Ut ,iu Learning Web Content Extraction from DOM
20th century.
Before the second world war, one of the most prominent figures in the field, Alan
Turing, publishes several articles on the theory of computation including [13], demon-
strating the computability of any problem with a sufficiently capable automaton. This
kind automaton would later be named Turing Machine .
This proof represented a big turning point in the field, as it raised the possibility
that the human mind could one day be abstracted this way. The end of the second
world war, however, marks one of the biggest leap in AI theory. In 1956, at Dartmouth
College, takes place the conference of the same name. This event is considered to be
the birth of AI as a branch of computer science [36, ch. 5]. Here the term Artificial
Intelligent is officially chosen for the field. This comes several years after Alan Turing
famously proposes a formal test for determining the intelligence of a machine — the
Turing Test [60].
Despite several lulls during the second half of the 20th century, artificial intelligence
has continued being a very active field of research and a common theme of human
fascination.
2.1.2 Machine Learning
Machine Learning (ML) is a field of computer science which has been historically asso-
ciated with AI, yet AI is not necessarily synonymous to ML. ML deals with algorithms
capable of “learning” and making predictions from given data. The first mention of
the term appears in Samuel’s work in 1959, in which the author implements a checkers
playing algorithm [49]. However, one of the most notable definitions of the process of
machine learning is that of Mitchell et al. in 1997 [39, p. 2]:
A computer program is said to learn from experience E with respect to some
class of tasks T and performance measure P , if its performance at tasks in T,
as measured by P , improves with experience E.
Machine learning can refer to a large class of problems, but we are mainly con-
cerned with algorithms which learn from samples. Russell and Norvig offer us with a
classification of such problems[48, p. 694–696]:
•Supervised — This is the case in which agents learn from samples to map a function
from input to output. Based on the nature of the codomain of the function,
supervised learning can be further divided into:
–Regression — If the function is continuous one (eg. price predictions).
8
Nichita Ut ,iu Learning Web Content Extraction from DOM
–Classification — When the codomain is discrete (ie. a set of classes – Figure 2.1).
•Unsupervised — In this case the agent learns to recognize patterns in data without
any prior associations given.
•Reinforcement — This consists of tasks such as games where based on the outputted
action of the agent a certain rewards is given. The agent is responsible of choosing
the next action based on these.
Figure 2.1: An example of a classification task1. Given the set of samples with two classes ( circles and triangles , a
classification task would involve finding a separation boundary such as L1, L2 or L3.
1Credits: Diagram by user DJGulp3 of commons.wikimedia.org, released into the public domain
Even though we could go into detail for each type of machine learning task enumer-
ated above, we focus just on classification . Our entire method described in Section 3.1
consists of such a task; therefore in the following paragraph we iterate over some
popular ML classifications models.
Logistic Regression
Logistic regression is a famous statistical model used since 1958 [66]. It is a simple
model that employs learning the coefficients of a linear combination that is then passed
through the sigmoid function (see Equation (2.1)). The aim is to maximize the log-
likelihood between this learned distribution and the distribution of probabilities inferred
from the training samples.
(x) =1
1 +ex(2.1)
Given a sample of the data X2Rn, a vector of coefficients 2Rnand a bias term
2Rthe probability associated to the sample Xis
p(Xj) =1
1 +e+X(2.2)
9
Nichita Ut ,iu Learning Web Content Extraction from DOM
Its simplicity and the fact that it has a closed solution often makes it a good default
for many ML applications. Its main downside, though, is that the linear class boundary
it generates cannot approximate more complex functions such as XOR [22, p. 170–172].
SVM
Support vector machines (SVMs) [15] are a class of machine learning algorithms very
similar to logistic regression in the sense that they, too, aim to learn a linear boundary
between the classes. One of the main differences, however, is that SVMs usually first
map the data to a higher-dimensional space before finding said boundary (as seen in
fig. 2.2).
Ø
Figure 2.2: Example of a Gaussian kernel trick used together with an SVM classifier1. We can see how the non-linear
class boundary on the leftactually corresponds to a linear boundary in the higher-dimensional space on the right .
1Credits: Diagram by user Zirguezi of commons.wikimedia.org, distributed under a CC BY-SA 4.0 license
The other difference is the fact that SVMs try to optimize an entirely different loss
function than the log-likelihood of logistic regression. What SVMs are trying to achieve
is to maximize the margin . This margin is the distance between the the two dotted
parallel hyperplanes from Figure 2.2. These two hyperplanes themselves are described
by the following equations
!w !x b= 1
!w !x b= 1(2.3)
In Equation (2.3), band !xare the learnable parameters of the model, where !xis
a vector corresponding to a sample of data. What we are trying to achieve is finding
the best values of these parameters for which all samples of one class are on one side
of one hyperplane and the other class is on the opposite side of the other hyperplane.
Moreover, we are trying to do this while finding a !w(normal vector of the hyperplane)
of minimum norm (to maximize the margin).
10
Nichita Ut ,iu Learning Web Content Extraction from DOM
The above example also called hard margin separation is only practical in the case
of linearly-separable classes. SVMs have a more general formulation called hinge loss
which is better suited in most cases, but is beyond the scope of this section.
To conclude, we also mention that the mapping applied to the samples can vary as
well. If we do not apply any mapping, the SVM is commonly referred to as a linear SVM .
Some other common mappings exist, the most famous one being the Gaussian kernel .
Decision Trees/Random Forests
Decision trees are a non-parametric model which is often used for classification, yet
usable for regression as well [39, p. 52–78]. As we can see in Figure 2.3, they behave as
a binary flowchart, each non-leaf node acting as predicate statement dependant on one
variable of the input.
Figure 2.3: Classical example of a decision tree used to predic the survival of passengers on the RMS Titanic. Evalu-
ating a sample consists of starting from the root node and traversing it until reaching a leaf.
Unfortunately, the biggest limitation of this model in practice is its tendency to
overfit as noted in [47, p. 587]. As a response, random forests were first developed in
1995 [25] in order to address this issue. Although several improvements have been
made to the model since then, the usage of decision trees as weak classifiers remains
the same.
Random forests are an ensemble method , meaning they output a prediction by ag-
gregating those of a set of weak learners (in this case decision trees). Even though
implementations vary, a textbook example which we can present is tree bagging . In this
configuration a set of decision trees is trained on several subsamples with repetition of
the training data. Predictions are made based on the average prediction of the trees.
11
Nichita Ut ,iu Learning Web Content Extraction from DOM
(a)Perceptron1
(b)Multilayer Perceptron1
Figure 2.4: Basic architecture for a multilayer perceptron (MLP). On the leftwe can see the basic unit of an MLP , the
Perceptron . The perceptron in and of itself models a linear combination fed through an activation function , however,
when more are coupled we obtain an MLP (each node on the right represents a perceptron). This is the most basic
example of an ANN.
1Credits: Diagrams by user MartinThoma of commons.wikimedia.org, distributed under a CC0 license
2.1.3 Artificial Neural Networks
Artificial Neural Networks (ANNs) are a parametric ML model very commonly used
in present literature. Their popularity stems from their ability to easily represent com-
plex functions [22, p. 5], something which models such as linear regression and SVMs
struggle with. ANNs are frequently referred to as deep learning in literature the name
deriving from their use of function composition.
In Figure 2.4 we can see an example of the most basic ANN architecture, a multilayer
perceptron (MLP). Its most basic unit, the perceptron acts as a simple logistic regres-
sion if the activation function 'is a sigmoid function. Each and every cell of the MLP
consists of a perceptron, and, thus, the entire network acts as a deep composition of
linear combinations and activation functions. The biggest advantage of these kind of
architectures is their large representation space. Additionally, if the activation function
is easily derivable (eg. sigmoid, ReLU), the model is trivially trainable through gradient
methods [46].
2.2 Content Extraction
We dedicate this section to elaborating on what content extraction is and what methods
are present in literature. We mainly focus on the works which are most relevant to
the field of content extraction at present, and those which most heavily influenced our
approach.
12
Nichita Ut ,iu Learning Web Content Extraction from DOM
2.2.1 Introduction to Web Content Extraction
As stated in the introduction of this chapter, content extraction is a topic in lit-
erature that has enjoyed popularity since the 90s. However, recent reviews such as
Kreuzer’s [33] highlight that performance improvements are still sought after. Inspired
by such reviews, we will be exploring the problem in the following passages.
Even with recent standards such as HTML5 [64] trying to standardize web content
and introduce tags with semantic weight (eg. article ,aside ,content , etc.), we can
see that there still is interest in the topic. We could therefore presume that, the problem
hasn’t been trivialized yet from an algorithmic standpoint.
For example, in Figure 2.5 we see an example of what the target content of the
extraction process might look like. From a human point of view we can agrees that such
a task seems pretty intuitive, however the openness of the problem makes us suspect it
is not easily-solvable by a machine.
Figure 2.5: In this figure we see an example of web content extraction from wikipedia.org . The content which is
considered to be semantically-relevant is highlighted with darker grey regions.
In the problem of content extraction the task can take many forms. It ranges from
extraction of the textual content of pages (as seen in [3]) to finding a visually-cohesive
segmentation of the page [8] (this is also be called web page segmentation orregion
extraction ). In this thesis, we focus on the former, and more specifically on classifying
HTML tags to isolate those which contain relevant textual content.
Another point worth mentioning is that approaches differ through the nature of
the information they use to extract data. From the reviews presented in [33], [53], we
identify three main classes of information:
•DOM tree — These approaches take into account the overall structure of the DOM
tree and its tags.
13
Nichita Ut ,iu Learning Web Content Extraction from DOM
•Visual — Methods falling into this category make use of the visual structure of
pages, usually inferred from HTML.
•Textual — Usually employ natural language processing (NLP) on the textual content
of the webpage.
2.2.2 Motivation
One reason for the difficulty of the problem is the constantly changing nature of
web content. Employment statistics highlight a trend of rising popularity in the field
of frontend web development [6]. This, in turn, indicates an active interest in web
development, and, consequently, constant change of the content available.
This also means that it’s difficult to propose a web content extraction method that
can withstand the test of time. A proof of this volatility is the decline in popularity of
visual methods. For example, many early 2000 methods such as the VIPS algorithm [8]
tried to extract a visual model of a page based on hints from HTML. Such a strong
assumption might not be reflective of modern web standards, as “visually-oriented”
tags such as hr(used in the VIPS paper) have been repurposed or deprecated by the
HTML5 standard [64].
Another factor adding to the difficulty is the fact that datasets have not adapted
yet to the current standards of the web. We can see this with one of the most popular
datasets used in literature: Cleaneval [3]. Although widely popular, it does not include
other information (ie. CSS, JS) about its pages other than pure HTML.
From a practical perspective, web content extraction has a few obvious applications.
These include things such as search engine optimization, web crawling augmentation
and text corpus extraction to name a few. In general, any task that can benefit from
extracting information from the web can benefit from content extraction.
2.2.3 Existing Methods
There are multiple approaches to web content extraction present in literature. As
noted in Section 2.2.1 one classification is based on the type of information used for
extraction(ie. dom, textual or visual). In the following paragraphs, we enumerate a few
works from the field with this taxonomy in mind.
Visual methods such as the VIPS algorithm [8] and its derivatives [35] try to deduct a
visual tree structure of pages based on HTML cues. One limitation of such methods is
that they make strong assumptions about the structure of webpages, such as expecting
<hr> tags to be reflective of the visual structure of the page (see. Section 2.2.2). Similarly,
14
Nichita Ut ,iu Learning Web Content Extraction from DOM
Burget and Rudolfova [7] build upon the VIPS algorithm and add CSS-derived features.
More recent approaches, such as [32], [42], [45], [63], [67], tend to use textual and
DOM information. Weninger, Hsu, and Han [67], for example, base their approach on
the density of tags in the text content of a page on a line-to-line basis. Their method has
good performance, while also trying to be agnostic about the semantic content of these
lines. Pasternack and Roth [42] also use a heuristic method based on DOM tags, while
Wu, Li, Hu, et al. [69] derive information from root-to-node paths.
In [32] we can see a textual model that avoids making strong assumptions about the
semantics of the analyzed text by using shallow text features . They employ a quantitative
linguistic approach, but they also take into account HTML specific metrics such as link
density. These kinds of features are then adopted by other well-performing papers such
as [63]. Works such as [45], [70] extract text token information, but from HTML tags
such as idandclass rather than the actual content, arguing that, developer-written
hints such as these, convey semantic information about their corresponding tags.
Many of these works model the problem as a classification one using machine learn-
ing. SVMs are a commonly used model [32], [70] as is linear regression [45]. Decision
trees are also used in some papers [32], while others leverage even deep learning models
in their pipelines [63].
As far as our method is concerned, we can classify it as one using DOM and, to a
lesser extent, textual information, with which we employ a machine learning approach
for the classification. This makes our method similar to [63] and [45] from which we
draw most of our inspiration.
2.3 Overview
In the last section, we presented general aspects of machine learning and provided
you with a short introduction into web content extraction and its present state in literature.
We mentioned current approaches in the field and how these integrate machine learning
into their methodology.
Taking these reasons into consideration, we can see the utility of tackling the problem
both from an academic and practical standpoint. Moreover, we presented a background
motivating several of the design decisions presented in Chapter 3.
15
Nichita Ut ,iu Learning Web Content Extraction from DOM
Chapter 3
Proposed Machine Learning Approach
to Content Extraction
In this chapter we introduce our novel approach to web content extraction using DOM
and textual features together with machine learning. We design a few experiments
to highlight our method’s performance and compare the results to those of existing
methods.
3.1 Method
We approached the content extraction problem as a binary classification problem on
the HTML tags of the documents. The two classes represent whether the tag contains or
not text that is regarded as content by the creators of the datasets. These tags correspond
to what others refer to as text blocks in literature [33].
3.1.1 Features
For each tag of an HTML document we have a corresponding sample (data point).
A sample contains the information extracted from its corresponding node, its ancestors
and aggregate information from its descendants (see Figure 3.1).
Figure 3.1: Feature extraction from node ( ul). Typical DOM tree structure ( white : the node corresponding to the
current data point, lines : ancestors of that node, checkered : descendants of that node). The feature classes on the right
are computed using information from the corresponding patterned nodes on the left.
17
Nichita Ut ,iu Learning Web Content Extraction from DOM
Node features Are a set of 7 numerical features extracted from the node containing
information about: depth, position among siblings, number of children, text length,
class attribute length, number of classes, id attribute length and 1 categorical feature
containing the tag type. The text lengths and number of classes provide similar infor-
mation to the shallow text features [32] used in similar works.
Ancestor features are identical to the node features, but are extracted from the an-
cestors of the node. For our experiments, we use up to 5 ancestors. In case the node
has a depth of less than that, we pad with the uppermost ancestor. This adds another
35 numerical features and 5 categorical ones (1 categorical and 7 numerical from each
ancestor). As other works point out, the root-to-node path can convey meaningful infor-
mation [61], [69]. Thorough our method we include a truncated version of it through
the ancestors’ tag types. We also append the ancestor features in a manner similar to
Vogels, Ganea, and Eickhoff [63].
Descendant features consist of aggregate values computed from the node features of
the nodes in the subtree of the current one. The node features (with the exception of depth
and position among siblings) are added for nodes on every level of the subtree and
averaged. These values, together with the number of nodes on each level are extracted,
for 5 levels of descendants. If insufficient, the values are padded with the lowermost
layer extracted, as before. This results in another 25 numeric features and another 5
numeric sparse ones describing the frequency of each type of tag on every level. This
encoding is similar to the unnormalized tag ratios pioneered by CETR [67], although at
a block, not line level.
Textual features are features extracted from the class and idattributes of a node. Some
authors argue that these are usually used as human-readable descriptors [45], [70] so
they conteain relevant information. We perform a heuristic splitting of the text into
tokens on punctuation and internal capitalization, as we observed many samples fit into
this pattern (eg. NavBar ,nav-bar ,navbar). We then use these tokens with a tf-idf
representation. The length of the ngrams, whether they are word or character level and
the use of idfhave been left as tunable hyperparameters of the model.
Based on on their nature we group features into two subsets: the node ,ancestor and
descendant features form what we call numeric features and textual features are used as
their own subset. We measure the performance impact of combining these groups. For
a detailed overview of the structure of a data point see Table 3.2.
18
Nichita Ut ,iu Learning Web Content Extraction from DOM
3.1.2 Models
We use several models for the classification task so we can compare their respective
performances. In terms of linear models we use a simple logistic regression and a linear
SVM, as both have been used before in literature [33], [45].
Because they are also employed by other authors [7], [32] and are known to provide
good baselines for most machine learning tasks, we also include decision trees and
random forests in our experiments.
Finally, we consider an artificial neural network model. As the experimental section
shows, similarly to random forests, this model is able to perform well in the absence of
thorough feature engineering [24].
In terms of hyperparameters, we tune a select few for each model according to what
we considered might have an impact on classification performance. Furthermore, we
also tune parameters of the feature extraction phases such as the use of idf and ngram
size. More importantly, for every classifier we can specify whether it should use normal
cost or balanced cost to compensate for the class imbalance [30].
With respect to preprocessing, before classification, we perform a feature selection
step based on the 2test scores. We do this to try to improve the performance of models
such as the ANN and the tree-based ones [31]. The threshold percentage is left as a
tunable parameter of the model as well.
In the end, we also apply a maximum absolute value scaling of the features as our
features are a combination of both continuous and one-hot values with vastly differing
scales (eg. 0;1for one hot features and [0;1]for the frequency of tags in descendants).
We choose this type of scaling because it does not center the values. Although centering
is usually done with other types of scaling, in our case it preserves the sparsity of our
highly-dimensional data, reducing memory consumption. For a full overview of the
pipeline see Table 3.1 and Figure 3.2.
Figure 3.2: Data flow of machine learning pipeline used. Input enters as raw HTML and we output the content
extracted from the corresponding page. The steps correspond to those presented in table 3.1.
19
Nichita Ut ,iu Learning Web Content Extraction from DOM
Step Description Parameters
Name Possible values
Numeric feature
extractionExtracts the node ,ancestor
and descendant features.ancestors 5
descendants 5
Text feature extraction Given the text from
class andid, extract
the tf/tf-idf features.ngram size f1;3g
ngram level fcharacter;word g
use tf-idf? fyes;nog
Feature selection Reduce the number of fea-
tures selecting only the
top based on the 2test.% of top features to keep [10;100]
Feature scaling Maximum absolute value
scaling
Classification Parameters for all classi-
fierscost weight fnormal, balanced g
Logistic regression regularization penalty
typefL1;L2g
regularization factor ( C)[10 1;104]
Linear SVM regularization factor ( C)[10 1;104]
Decision tree maximum features flog2;sqrtg
Random forest maximum features flog2;sqrtg
ANN activation frelu, selu, sigmoid,
tanh g
optimizer fadagrad, adam, rm-
sprop g
dropout [0;0:4]
architecture f1000, 1000-500, 1000-500-
100, 1000-500-100-100 g
Table 3.1: Table showing all the steps in our pipeline, their respective parameters and the values used for tuning.
20
Nichita Ut ,iu Learning Web Content Extraction from DOM
Feature identifier Description
Node features
depth depth of the node from root
possib position of the node among its siblings
nbchild number of children
text len length of the inner text
clslen length of the class attribute text
nbcls number of classes (number of space-delimited words
inclass )
idlen length of the idattribute text
tag<type> one-hot encoding of the type of the node
Ancestor features: <anc nb> represents the height of the ancestor from the node corresponding to the data point.
anc<anc nb>depth depth of the ancestor from root
anc<anc nb>possib position of the ancestor among its siblings
anc<anc nb>nbchild number of children
anc<anc nb>text len length of the inner text
anc<anc nb>clslen length of the class attribute text
anc<anc nb>nbcls number of classes
anc<anc nb>idlen length of the idattribute text
anc<anc nb>tag<type> one-hot encoding of the type of the node
Descendant features: <desc nb> represents the depth in the subtree of the nodes from which the information is
extracted.
desc <desc nb>total nodes total number of nodes on the layer
desc <desc nb>avgnbchild average number of children
desc <desc nb>avgtext len average length of the inner text
desc <desc nb>avgclslen average length of the class attribute text
desc <desc nb>avgnbcls average number of classes
desc <desc nb>avgidlen average length of the idattribute text
desc <desc nb>freq tag<type> frequency of tag <type> among the nodes
Text features: Tf-idf information. Depending on the hyperaparmeters of the pipeline the terms could be either
ngrams or words. Whether idfis used is also left as hyperparameter.
idfreq <term> frequency of term <term> in the idattribute
clsfreq <term> frequency of term <term> in the class attribute
Table 3.2: All the features of a data point and their identifiers. After converting the categorical and frequency fea-
tures to a dense representation, the total number of features per sample is 4541 for the Cleaneval dataset and 6787
for Dragnet.
21
Nichita Ut ,iu Learning Web Content Extraction from DOM
3.2 Experiments
Our experiments consist of evaluating the performance of our models on both the
Cleaneval and Dragnet datasets. We test all possible combinations of models and feature
classes in terms of F1-score. This way we can compare the impact of both the model
and feature subset on the overall performance of our method. We use the F1metric as
it compensates for the class imbalance between content and non-content and it is very
commonly used in literature [42], [45], [55]–[57], [63], [70]. For hyperparameter tuning,
we employ a randomized search [5].
3.2.1 Datasets
Some works use text blocks [32], [42], [45], [57] for their samples while others opt
for the more fine-grained approach of treating DOM tree leaves as blocks [63]. For our
methods, we increase the granularity even further by treating every tag as a sample.
Although adding complexity to the problem, this way we can also try to train our
models to identify content blocks within the entire set of tags, not only the subset of
blocks. Additionally, we do not make any assumptions about the structure of content
during the feature extraction step.
Both datasets we use, however, have their gold standards (ground truths) defined as
the blocks of text that aim to be extracted from the HTML documents. Because there
is no objective mapping from this text back to the tags as others note [33], we decided
to use the algorithm based on the Longest Common Subsequence (LCS) employed by
Peters and Lecocq in the implementation of their method [44].
We use the algorithm to identify the tags that correspond to the gold standard blocks
and then do the reverse mapping to HTML tags. This way, even though our mapping is
not objectively accurate, we ensure it is the same with the one used by other authors.
The DOM tree is traversed and the entire textual content of the page is extracted
with the tag mapping preserved. This text and the gold standard are tokenized into
tuples of consecutive words ( ldocandlgold).
ldoc= (wdoc 1;wdoc 2;:::)
lgold= (wgold 1;wgold 2;:::)
lcontent =LCS (ldoc;lgold)(3.1)
Each block bof text from the HTML document for which more than 0:1of its tokens
belong tolcontent is considered to be content . With this information, the labeling of HTML
tags becomes trivial and this way we mimic the process used in [44].
We use an existing method from another paper’s code because it has the advantage
22
Nichita Ut ,iu Learning Web Content Extraction from DOM
of ensuring comparability of the results. Even if our representation differs we ensure
there is a bijective relationship between our data points and theirs. Also, in the Cleaneval
dataset, for example, the authors included HTML tag hints (albeit limited: h1,p,li) to
aid machine learning efforts [3]. Our process can be seen as an extension of that initial
intention by adding features from the DOM.
Dataset Samples Blocks Content Blocks Sample
ImbalanceBlock
Imbalance
Cleaneval 323439 43556 29719 1:9.88 1:0.46
Dragnet 1085498 126254 51199 1:20.20 1:1.46
Table 3.3: Number of samples compared to how many of them correspond to text blocks [45] and content blocks . The
last two columns show the class imbalance ratio when using the entire dataset as opposed to just the text blocks.
Moreover, we can also store which tags directly correspond to text blocks. With
this information we can filter and use only these for our first experiment to ensure
consistency with the experiments in [45] and [63]. By omitting the filtering step we
evaluate the performance on the entire set of HTML tags.
Due to encoding issues and invalid HTML structure we remove a few pages from
the dataset, before performing the conversion (26/714 from Cleaneval and 4/1381 from
Dragnet). We do this out of convenience as our method cannot accommodate for the
malformed HTML present in some of the pages. Other authors also perform this step
due these shortcomings of the datasets [45].
3.2.2 Experimental Design
For our experiments, we evluate the performance in terms of F1measure for every
combination of model and feature set. Testing is done on a hold-out group of 10% of
the total pages in each dataset (71/715 for Cleaneval and 138/1378 for Dragnet). In
order to prevent overfitting during the model selection phase, we perform a nested
cross validation [9]. For the internal loop we use a 5-fold cross validation over the pages
to avoid overfitting.
However, due to the high number of splits and tunable parameters, an exhaus-
tive grid search would be too computationally expensive so we opt for a randomized
search [5] of 100 iterations. Not only is this less expensive, but allows us to better explore
continuous hyperparameters such as regularization factor for linear models and dropout
for ANNs (see Table 3.1 for an overview of all hyperparameters).
For the first experiment we are using only the samples that correspond to text
blocks (as extracted when converting the dataset). The full datasets are significantly
larger and have a much greater class imbalance (see Section 3.2.1) which would lead
to greater errors without compenstation [30]. In addition to this, we run a second
experiment to measure the performance on the set of all the HTML tags.
23
Nichita Ut ,iu Learning Web Content Extraction from DOM
A 5-fold cross validation in the internal model selection amounts to 500 training runs
and would be too computationally-expensive in some cases. Therefore, when training
the deep learning models and when using the entire set of tags, we only use a 10%
hold-out validation set to select the models and perform 50 iterations of randomized
parameter search. This way we reduce computation cost, at the expense of worse
parameter tuning and potentially more pessimistic performance estimations in the
external performance evaluation [4], [9].
3.3 Results
Table 3.4 shows the results in terms of F1measure comparing our different mod-
el/feature combinations for both the experiments.
For the experiment (using only text blocks), we note the the best performing model
on the Cleaneval dataset offers comparable, albeit lower performance than the state-of-
the-art [63]. However, the best performing models achieve state-of-the-art by a large
margin on the Dragnet dataset [45], [70].
For the Cleaneval benchmark, performances are similar between different models.
We notice a slight increase with the addition of textual features. The best performing
models remain, however, Web2Text [63] and Peters and Lecocq’s one [45].
For Dragnet we see a noticeable advantage of decision trees, random forests and
ANNs over the linear regression and SVM. Similarly to the Cleaneval run, textual
features provide only a slight benefit across the board. Regardless, all tested models
perform better than the current state-of-the art when using either only numeric features
or both numeric and textual ones.
We notice that for the second experiment, our models perform slightly worse on
Cleaneval. On all the three combinations of feature sets, linear regression outperforms
all other models. There is no significant increase when adding textual features and text
features alone perform noticeably worse than in the first experiment. Performance is
also lower on the Dragnet dataset, though, some models continue to outperform the
state-of-the art.
24
Nichita Ut ,iu Learning Web Content Extraction from DOM
Dataset Features F1(%) by model
LR SVM DT RF ANN W2TyDOz
Cleaneval
(blocks)Num 84.58 82.68 82.91 83.26 84.31 – –
Text 80.34 79.51 81.07 80.78 79.57 – –
Num+Text 85.48 85.49 73.37 82.83 86.63 – –
Other – – – – – 88.00 90.70
Dragnet
(blocks)Num 90.40 90.11 94.09 96.8996.67 – –
Text 72.13 72.05 72.10 72.16 72.07 – –
Num+Text 93.29 91.83 94.64 96.80 96.50 – –
Other – – – – – – 83.60
Cleaneval
(all tags)Num 81.14 76.06 61.54 78.88 80.44 – –
Text 12.90 12.06 6.35 5.97 5.66 – –
Num+Text 80.85 80.40 77.52 79.90 80.26 – –
Dragnet
(all tags)Num 84.14 77.50 90.17 95.90 92.37 – –
Text 27.92 38.54 29.78 30.04 12.50 – –
Num+Text 90.05 86.10 87.85 95.44 95.63 – –
Table 3.4:F1scores for both experiments for different combinations of features and models: logistic regression (RL),
linear SVM (SVM), decision tree (DT), random forest (RF), artificial neural network (ANN).
yWeb2Text [63]
zDragnet original [45]
these values represent state-of-the-art results
3.4 Conclusions and Potential Improvements
In this chapter we detailed our original machine learning model for web content
extraction, proposed a series of experiments to evaluate its performance in terms of
F1score and assessed their results. After evaluation, we note that we achieved a
competitive performance of 86.63 on the Cleaneval dataset with our ANN model, and a
state-of-the-art result of 96.89 on Dragnet with a random forest classifier.
Additionally, our models performed slightly worse when trained on the entire set
of tags, yet they still outperformed current methods on Dragnet. We can also state
that, overall, ANNs and random forests were the best performing models, with logistic
regression performing the best on the all-tags Cleaneval benchmark.
One potential limitation of our methodology is the use of a single leave-out group
for evaluation. We could improve this by evaluating using a more rigorous CV method
such as K-fold. This way we would get more statistically-sound results at the cost of a
potential decrease in average performance.
25
Nichita Ut ,iu Learning Web Content Extraction from DOM
Chapter 4
Web Content Extraction Application
In the following sections, we discuss the practical applications of the methods
presented above. We implement a web application that lets users apply our trained
models on arbitrary web pages. After applying the model we present the user with a
visual overview of the classification results on the actual content of the page. On the
server side of the application, admins should be able to train models on a given dataset
and specify the models users can use. These models are saved and used with the pages
inputted by the users.
Based on these specifications, we implement a full-stack web application in Django
REST (DRF) [11] and React [18]. In the following subsections, starting from our re-
quirements, we design our application in a top-down manner. First, we describe our
desired usecases and intended user-application interactions. Afterwards, based on these
considerations, we design our application from and architectural perspective.
The figures throughout this chapter will be using notations specific to the UML 2.5
standard [41].
4.1 Usecases
In this section we discuss the intended behaviour of our application. To do so we
present a diagram of all possible usecases (see Figure 4.1). Considering our usecases,
we decide to split the classes of actors into two (i.e. Users and Admin ).
Users represent the clients who are able to request the classification of HTML of a
given website. They do this by submitting an URL which is processed by the server.
The contents are downloaded, preprocessed and classified on the servers side without
user intervention. Users should be able to view all the previous classification jobs on
the site. Overall the process should be as opaque as possible exposing the least amount
of information.
The other class of clients is Admin . These clients are those who have direct command
line access to the application through a command-line interface (CLI). This kind of access
comes with privileges to manage the models available to the users . They are able to
launch the training of models and persist them. They can then select from the command
27
Nichita Ut ,iu Learning Web Content Extraction from DOM
line utility which models will be shown to Users to chose from when queuing classifica-
tion jobs.
We implement access permissions through separation of the user andadmin interfaces.
In order to perform model management tasks, users must have direct or ssh access
to the CLI. This mechanism provides security through the fact that the interface is
unavailable to anyone without access to the server’s network.
Figure 4.1: Diagram designed usecases. The entire application consists of two possible interfaces, a web interface
for regular users queuing classification tasks, and a CLI for the admin to train and managed saved models.
4.1.1 Users
In Figure 4.2 we explore the interactions available to the User actors. As a primary
feature, users should be allowed to classify whatever webpage they desire, provided
they supply a valid URL. They should only be required to input this URL and select
one of the classifiers available to use with the content. The process should be entirely
opaque to them, as the application should take care of steps such as downloading the
HTML content of the page. Furthermore, as this process can be time-consuming (up to
a minute from our experience), users should be informed that their results are pending
in a non-blocking fashion.
The same behaviour should be used when accessing a result page whose classifica-
tion task hasn’t finished executing. After the task has finished, the user should be able
to see the results. In case the task failed (eg. page couldn’t be downloaded, server error,
etc.), the application should label it accordingly.
28
Nichita Ut ,iu Learning Web Content Extraction from DOM
Figure 4.2: Activity diagram detailing the usecases for the User class of actors presented in Figure 4.1.
However, in case of a success, the user will see the content of the downloaded page in
an embedded view. Every piece of content that was extracted by the learning algorithm
will be highlighted within this view.
For the landing page of the website, we opted for a list view of all successfully
finished classification jobs. Clicking on any of these jobs will redirect the user to the
corresponding results page. Although the default page should be this one, the user
should sill be able to opt to view the lists of pending and failed jobs.
Each of these lists will be paged and sorted in descending order of their finish
time (unfinished jobs come first and are sorted by their start time). All jobs will be
accompanied by their corresponding URL and finished job will show their end time as
well.
29
Nichita Ut ,iu Learning Web Content Extraction from DOM
4.1.2 Admins
Although admins have the same rights as users regarding submitting classification
tasks, they can also manage and train models. We decided that regular users should
not have access to such operations as these are more performance intensive and, conse-
quently, represent a security threat that could lead to denial of service . Access to these
actions is done through a CLI application available on the same machine running the
backend. We can see an overview of these activities in Figure 4.3.
Figure 4.3: Activity diagram detailing the usecases for the Admin class of actors presented in Figure 4.1.
We can clearly see that the model saving usecase includes model training . This separa-
tion is done because we decided that the admins should be able to train a model and
evaluate its performance, yet not necessarily want to use such a model in production.
Evidently, in order to save the model, they are still required to train it beforehand.
The resulting serialized models can be later uploaded to the server, under a name
of the admin’s choosing. We design this separation of training and publish tasks to
facilitate access to serialized model files. This way, admins can import such models in
external applications without having to implement fetching from the server. This has
the added benefit of easier debugging of models if needed.
30
Nichita Ut ,iu Learning Web Content Extraction from DOM
4.2 System Overview
Having thoroughly defined the desired behaviour of our application in Section 4.1,
we follow up with the system architecture. We choose a classic full-stack web application
architecture (see Figure 4.4) with clear separation between frontend and backend [62]. In
this section we discuss the system and the design motivations for each of the components.
These components represent a high-level architectural abstraction of the elements of
our application.
Figure 4.4: High level component diagram presenting an overview of the system architecture. The backend ,frontend ,
worker and library are implemented by us, whereas the Redis and SQL servers are third-party service dependencies.
4.2.1 Frontend
The frontend consists of a Javascript web application implemented in React [18].
This application will run in browser and will provide us with a basic GUI. The commu-
nication with the server is done through AJAX requests in usual Javascript fashion. For
this purpose, on the server side we expose a RESTful API [19] for clients to interact with.
4.2.2 Backend
In order to futureproof our application against the services it depends on (ie. Redis
and SQL Server) we decided to design our backend (HTML Web Service and workers)
adhering to 12-factor principles [68]. Apart from the implementation-specific considera-
tions stated in these guidelines, we expressly follow those stating that we should treat
the components we depend on as services. By doing so, we ensure that our backend is
agnostic about our deployment setup, needing only access to an arbitrary SQL server
and a Redis [50] queue to function properly.
Moreover, employing this platform-as-a-service (PaaS) [34] approach allows us to easily
deploy thee application to cloud platforms such as Heroku or AWS in the future. It eases
31
Nichita Ut ,iu Learning Web Content Extraction from DOM
configuration as well, which in turn makes horizontal scaling easier. Furthermore, using
theWSGI [29] protocol with our REST service we can easily change web servers (eg.
Apache, Nginx, etc.) if need be.
Not needing a specific setup, and the ease of configuration of this approach, will not
only make deployment faster, but would also aid us during the development process. If
dedicated servers are not available for SQL or the Redis queue, we can virtualize them
trivially with Docker [38].
Regarding the worker process, we made the conscious decision to have it share
a codebase with the REST service. We do this because these are the components
with the highest cohesion within the system, sharing dependencies and requiring the
same libraries and services. This cohesion does not, however, indicate inseparability.
Although implemented on the same codebase, both run in their individual processes,
allowing horizontal scaling of both REST and job workers.
4.2.3 SQL and Redis
Through our use of DRF for our web service we are able to use Django’s ORM [16].
By leveraging an ORM we are able to build our service without tailoring it to a specific
SQL implementation. Because of this, the SQL Server component from Figure 4.4 can
represent any implementation of the service. In spite of not using any of its specific
features, we opted for PostgreSQL [58] for our local deployment as it provides a free and
open-source solution for development.
As far as our choice of Redis as a task queue goes, we took the decision for the sake
of simplicity. Worker abstraction solutions such as Celery [54], which are able to easily
swap messaging protocols, exist within the Python ecosystem. These would allow us
to abstract away the broker service used for enqueuing jobs, but we chose Redis as a
lightweight and fast [23], [50] default for our rather small-scale system.
4.2.4 LearnHtml Library
For future reusability, we opted to implement the ML pipeline for web content
extraction separately from the backend (as seen in Figure 4.4) as a standalone Python
library. This library consists of a Python package implementing the algorithm described
in Chapter 3. We use it in the worker code to implement classification by using the
library’s public API.
To facilitate redistribution and later use, we use the setuptools library as a build tool.
This way we can bundle our library in a binary wheel [27] package and install it on our
backend. Moreover, the library becomes trivial to use in future projects.
32
Nichita Ut ,iu Learning Web Content Extraction from DOM
4.3 System Design
In Section 4.2, we discussed how we structure our system overall, how we separate
the code of each component and what third-party services these depend on. In this
section we delve deeper into low-level aspects of these components continuing our
top-down approach. We describe component interactions more thoroughly and discuss
the design of each one in isolation.
4.3.1 Interactions
Figure 4.5: Sequence diagram showing the interaction between components when the user submits a classification
request from the web frontend. This sequence represent one of the more critical scenarios of our application and it
illustrates most of the interactions between the REST Service and Worker daemon through the SQL and Redis services.
We begin by analyzing the interactions over time between the components presented
in Section 4.2. To do this, we dissect the sequence of events which occurs during the
happy flow of the Queue Classification Result usecase (see Figure 4.1). We still do not
concern ourselves with implementation details, but we do illustrate how our choice
of inter-component communication protocols (ie. Ajax/REST, SQL, Redis) affects be-
haviour behind-the-scenes.
33
Nichita Ut ,iu Learning Web Content Extraction from DOM
From this figure, we can clearly see how the Redis queue is used as lightweight
message passing interface between the Web service and the Worker daemon. The in-
memory storage used by Redis makes it a good candidate for these kind of task [23].
On the other hand, due to this volatility, we will not be using it to pass back our large
classification results. We use SQL as a persistence layer instead, because of its robustness
to failure as opposed to Redis which is not designed with persistence in mind [50].
4.3.2 Data Model
In this subsection we approach another important aspect of our application in terms
of design. We present the conceptual model of data on the backend side of our sys-
tem (see Figure 4.6). Because our application is designed with no functionality other
than web page extraction in mind, our entities are kept to a minimum. In the following
paragraphs we present some noteworthy design decisions.
Figure 4.6: Entity relationship diagram illustrating the conceptual data model used for our backend. Crow’s foot
notation is used to represent multiplicity.
First, we opted to keep page downloads in the database as their content is actively
passed on to the workers and should be sent back to the user. We also keep page download
entities rather than pages as a page (url) should be able to be queued for classification
more than once and between these classifications the content may have changed. This
way, we are basically keeping timestamped revisions of the page.
Classifiers are defined only by their identifier and the date they are added at. All the
relevant data (ie. model, learned parameters and hyperparameters) are to be stored
under the serialized model field. What format to use for storage remains an imple-
mentation detail we do not concern ourselves with.
34
Nichita Ut ,iu Learning Web Content Extraction from DOM
Classification Jobs and classification results represent the persisted output of a finished
classification task. The jobentity holds information about the success status of the task,
its start and end time. Because these tasks basically amount to binary classification , we
made the conscious decision to only include the xpaths (unique identifiers of tags in
an XML document) [65] of the positive results. This is done, because as we saw in
Section 3.2.1, negative samples greatly outnumber the positives. Thus, we are removing
redundancy and reducing memory consumption.
4.4 Technologies Used and Implementation
Following up on the model presented in the previous section, we further detail
the technologies used. For each piece of software, we write a short presentation and
elaborate on how and why they were used in our system.
4.4.1 React
React is a Javascript framework developed by Facebook [18]. It is one of the most
active Javascript projects on github; in 2017 it was one of the most forked projects
overall [20]. Some of its core features include its easy to use component architecture
and the inclusion of JSX — and extension of Javascript allowing an XML-like syntax to
instantiate said components.
The main reasons why we chose it for our frontend interface are its easy setup, high
modularity and relatively gentle learning curve.
4.4.2 Django REST Framework
Django REST Framework (DRF) [11] is a framework extending the popular Django [16]
framework. DRF builds upon the MVC architecture of Django and adds facilities for
building REST [19] web services. DRF takes advantage Django’s existing features (eg.
ORM, view abstraction, user management, etc.) [17] and adds REST specific features
such as serializers [12].
In the following passages, we enumerate some of the core concepts and features of
DRF and illustrate how we use them in our application.
ORM
One of the features with the biggest impact on our implementation and an argument
for choosing DRF as a framework is Django’s Object Relational Mapper (ORM) [2]. The
35
Nichita Ut ,iu Learning Web Content Extraction from DOM
ORM approach allows us to easily transpose the conceptual model presented in Fig-
ure 4.6 to Python code. The process becomes as easy as defining our entities in code,
and then the framework takes care of DB abstractions and migrations for us.
1from django.db import models
2from django.core import validators
3
4class PageDownload (models.Model):
5 url = models.TextField(validators=[validators.URLValidator()],
6 null=False)
7 content = models.TextField(blank=False, null=True)
8 date_downloaded = models.DateTimeField(auto_now=True, null=True)
Listing 1: Implementation of the PageDownload entity. By using ORM we simply define the fields in terms of their
names and abstract types. Typical constraints can passed as well in the arguments.
As we can see in Listing 1, the implementation of the Page Download entity is trivial.
We can see complex features such as validation and creation date (implemented through
theauto_now parameter on line 8) which wouldn’t have been as straightforward in
pure SQL. Moreover, these definitions provide a strong abstraction between code and
different SQL dialects where such table definitions would have vastly varying syntaxes.
Serializers
Another crowning feature of DRF and the backbone of its request handling process
areserializers . Serializers serve as an intermediate layer between Django’s models and
views. In a traditional MVC sense, they would correspond to the controller . The same
paradigm is used here as well, but the serializers’ role is more specialized. Serializers
provide a mapping between the CRUD operations of REST [19] and ORM operations.
They ensure that transformation between REST payloads and internal objects is done
properly and that CRUD operations preserve the logical consistency of our data.
To illustrate the use of such a serializer, we can see Listing 2. Here, we show-
case 2 serializers for the same model presented in Listing 1. Through the use of
ModelSerializer we instruct DRF to automatically create the mappings between
ORM model fields and REST payload fields. In addition to this, we see on lines 12 –
15 that we can also implement custom logic using SerializerMethodField . This
mechanism is just one of the many DRF exposes to customize serializers [12, see Serial-
izers].
Views
The topmost layer of a DRF application consists in its views . A view class works
with url parameters and HTTP methods (GET, POST, PUT, etc.). Its job is to hide these
36
Nichita Ut ,iu Learning Web Content Extraction from DOM
1from rest_framework import serializers
2from .models import PageDownload
3
4class PageDetailSerializer (serializers.ModelSerializer):
5"""Serializer used for the detail view of a page download object."""
6 class Meta :
7 model = PageDownload
8 fields = ( 'id' ,'url' ,'content' ,'date_downloaded' )
9
10class PageListSerializer (serializers.ModelSerializer):
11"""List serializer for pages. Does not include content."""
12 is_failed = serializers.SerializerMethodField()
13
14 def get_is_failed(self, obj):
15 return obj.content isNone
16
17 class Meta :
18 model = PageDownload
19 fields = ( 'id' ,'url' ,'is_failed' ,'date_downloaded' )
Listing 2: Implementation of 2 serializers corresponding to the PageDownload model presented in Listing 1. Both are
used to serialize the same type of object, yet exhibit different logic, used in different contexts (a list view and detail
view respectively).
low-level details from serializers, so they needn’t be concerned with anything other
than ORM object instances.
In Listing 3 we observe a practical example from our application related to the
PageDownload resource. By inheriting from the ReadOnlyModelViewSet we have
handlers for list and detail views readily implemented.
DRF provides many hooks to override default behaviour. We can see two such
examples here. One is present on line 6 and refers to the queryset field through which
we specify the set of ORM objects the list and detail view should fetch from. The other
one is our custom serializer getter method on lines 8 – 14. This method selects the
desired serializer based on whether a list or detail (equivalent to retrieve ) view is desired.
Conclusion
Based on the simplicity of configuration and customizability of DRF we decided to
use it for our backend implementation. Although our application is small-scale, DRF
provides a framework to extend it in the future if need be.
4.4.3 Scikit-Learn
Scikit-learn (also known as sklearn ) [43] is an open-source machine learning toolkit
implemented in Python. It is a popular and actively-maintained project; at the time of
writing its latest stable release was in late 2017 [52].
37
Nichita Ut ,iu Learning Web Content Extraction from DOM
1from rest_framework import viewsets
2from .serializers import PageListSerializer, PageDetailSerializer
3
4class PageViewSet (viewsets.ReadOnlyModelViewSet):
5# list all PageDownload objects from DB
6 queryset = PageDownload.objects.all()
7
8 def get_serializer_class(self):
9 """Conditional serializer based on action"""
10 ifself.action == 'list' :
11 return PageListSerializer
12 ifself.action == 'retrieve' :
13 return PageDetailSerializer
14 return PageListSerializer
Listing 3: ViewSet object implementing the interface between URL endpoints and serializers for the Page Download
resource.
In addition to these advantages, it also provides a solid and easy to use implementa-
tion of many popular classification algorithms such as logistic regression ,decision trees
and random forests [51]. Because of these reasons, we elected to use it as the main ML
toolkit in our standalone classification library.
Figure 4.7: Class diagram of the core interfaces in sklearn [51]. In this diagram we present a simplified overview of
the functionalities they expose.
Note: These classes present a logical simplification as formal interfaces are not usual in Python and neither are they
present in the sklearn documentation.
Classifiers and Transformers
Classifiers and transformers are the two main components of our ML pipeline. As
both extend the Estimator interface, it means that, internally, they have a set of hyper-
parameters which can be manipulated and introspected through the get_params and
set_params methods. This is of utmost importance as it’s the mechanism which
allows easy hyperparameter tuning.
38
Nichita Ut ,iu Learning Web Content Extraction from DOM
Thefit method is common to both classifiers and transformers. It represents the
training step which receives a labeled dataset used by both prediction models (ie. regres-
sion, classification) and data transformation processes (eg. PCA, feature selection, etc.)
alike. It expects the data to come in a tabular format with every row of Xrepresenting a
sample and ybeing a list of target values. Many of the models usually accept pandas [37]
and numpy [40] tables.
Classifier provides an abstracted interface to any kind of ML model used for clas-
sification. After training such a model, for example a logistic regression , the API client
has access to the predict method. This method accepts any tabular data, with a row
format consistent with the training data. The output is a list of of predicted labels
corresponding to each inputted sample.
Transformers implement a very similar interface which expects fitting before calling
thetransform method. This fitting step can be anything from determining the primary
components of Xin the case of a PCA to computing correlation coefficients for feature
selection . The result of transform is a matrix of the same number of rows as X. Trans-
formers also implement a convenience function intuitively called fit_transform .
fit:AqnBq7!?=)8
<
:transform m:Amn7!Cmp
predictm:Amn7!Bmwherem;n;p;q2N(4.1)
In Equation (4.1) we can see in mathematical notation what calling fit with a matrix
of typeA, sizeanand target variable of type Bentails. It implies that transform
will output a transformed matrix with a corresponding number of rows of an arbitrary
type and the predict will output a vector of same number of elements as rows in the
input. Moreover the type of the predicted values will be the same as the one during the
fitting process.
Pipelines and Feature Unions
While most classification usecases can be implemented using only independent
transformers and classifiers, it is often more convenient and descriptive to view the
entire process as a data pipeline. To facilitate this, sklearn implements a special kind of
classifier, aptly named Pipeline .
Pipelines contain a list of estimators called steps . Its job is to chain the output of one
estimator to the next. This way we can achieve complex transformations, all exposed
through the simple interface of a classifier.
39
Nichita Ut ,iu Learning Web Content Extraction from DOM
All the steps but the last must implement the transformer interface. As for the last
one, depending on its nature, the interface of the pipeline changes as well: if the last
estimator is a classifier, the pipeline behaves as a classifier, whereas if it’s a transformer,
it acts accordingly.
As seen in Figure 4.7 another utility class is the FeatureUnion . As the name
suggests, this class implements the transform method which returns the concatenated
result of all its internal transformers.
This is particularly useful when we want to combine data from several different
sources or perform different transformations on the same data (eg. concatenate the
3-dimensional PCA of the data to the data itself). In our case (as seen in Figure 3.2 on
page 19) we can use a FeatureUnion transformer to combine our textual and numeric
features.
Moreover, when combined with sklearn’s Pipeline , we can fully implement our
ML pipeline as a single object. Not only is this easy, but it allows us to expressly tune our
hyperparameters with standard sklearn utilities [51, see Hyper-parameter optimizers].
Conclusion
From the passages above, we can see how sklearn offers a solution to cleanly imple-
ment our model and fine tune it. Because of this, its popularity and active-maintenance,
we decided to use it heavily in our web extraction library. We note that we used it
to implement the pipeline illustrated in Figure 3.2 as well as test it with the logistic
regression ,SVM ,decision tree and random forest classifiers seen in Table 3.4.
4.4.4 Tensorflow and Keras
Although we are not working directly with Tensorflow [1] we are using it indirectly
through Keras [10], so we discuss the latter in more detail.
Tensorflow is a library developed by Google for numerical computation using data
flow graphs. It is one of the most popular deep learning frameworks at the time of
writing and achieves comparable or better performances than its competitors with GPU
execution [21]. It also enjoys active development from the Google Brain team which
maintain it and use it for numerous internal Google projects.
It’s worth mentioning that, at the end of 2017, Tensorflow was the most forked
repository and among the top 10 repositories on github both by number of contribu-
tors and by number of reviews, making it the most popular ML tool on the platform [20].
40
Nichita Ut ,iu Learning Web Content Extraction from DOM
Keras is a library that offers a high-level API for creating deep-learning models. Even
though the library can operate on various computational backends, Tensorflow is the
default. Not only this, but Keras was acquired by Tensorflow in 2017 and currently
shares a codebase with it [59]. With this in mind, using them together seems the most
sensible choice.
Keras also has the upside of being compatible with the sklearn estimator API, which
means we can easily integrate it in our pipeline together with the preprocessing and
feature extraction steps. In Listing 4 we can see just how easy this process is.
1from keras import layers
2from keras import models
3from keras.wrappers.scikit_learn import KerasClassifier
4
5SAMPLE_SIZE = 25 # arbitrary size, depends on the data
6
7def model():
8 model = models.Sequential([
9 layers.Dense(100, input_dim=SAMPLE_SIZE, activation= 'relu' ),
10 layers.Dropout(0.5),
11 layers.Dense(1, activation= 'sigmoid' )
12 ])
13 model.compile(loss= 'binary_crossentropy' ,
14 optimizer= 'rmsprop' , metrics=[ 'accuracy' ])
15 return model
16
17classifier = KerasClassifier(build_fn=model, nb_epoch=10, batch_size=128)
Listing 4: In this code snippet, we construct a Keras deep learning model and wrap it in an sklearn compatibility
class. The resulting classifier can be easily dropped into classic sklearn code.
In Listing 4 we implemented a moderately complex deep learning model in just
a few lines of code. Such a model would have been much harder to implement in
plain Tensorflow, not to mention that using this model in sklearn code is made possible
simply through the adapter on line 17. Not only this, but we should note that the
model function can be parameterized and the parameters can be manipulated through
theset_param /get_param interface described in Section 4.4.3, allowing them to be
tuned as hyperparameters.
Conclusion
Based on the good performance of Tensorflow and the interoperability Keras has
with our technolgy stack, namely sklearn, we decided to use these two within our clas-
sification library. More precisely, we employed Keras to implement our deep learning
model described in Table 3.1 which achieved some of the best performances among our
models (as seen in Table 3.4).
41
Nichita Ut ,iu Learning Web Content Extraction from DOM
4.5 User Manual
Our application implements two user interfaces. One of them is the web interface
displayed to users who access our web fronted. It allows the queuing of classification
jobs and viewing the results. The other interface consists of a few CLI scripts exposed
through Django’s manage.py utility. In this section we will go over both interfaces and
briefly describe their functionalities.
4.5.1 Web
Figure 4.8: Web frontend interface. Acts as the landing page of our website. UI elements: 1:result type selector, 2:
classifier selector, 3:URL input field, 4:queue classification button, 5:list of results.
In Figure 4.8 we observe the UI elements of the web interface for regular users.
Through this screen, users can access a few functionalities:
•By clicking on any of the 3 type selector buttons ( 1) the user can prompt the list to
show either pending, failed or done classification tasks.
•After choosing a classifier ( 2) and inputting a valid URL ( 3), the user can press ( 4)
to queue a classification job for that webpage. They are then redirected to the
corresponding results page.
•By clicking any item in the list ( 5) the user can access the result page of the
corresponding classification job.
4.5.2 CLI
The command line interface is accessible to the admin of the application on the same
machine as the REST service. It consists of two scripts added to the manage.py utility.
42
Nichita Ut ,iu Learning Web Content Extraction from DOM
They are exposed under the names train andpublish and, intuitively, allow the
training and publishing of models. To exemplify and document their usage we present
their UNIX-style help messages in Listing 5.
# ./manage.py train –help
Usage: train [OPTIONS] DATASET_FILES
Trains a model over a dataset, given a set of values of parameters to use
for the CV. Parameters used:
Options:
-v, –verbosity LVL Either CRITICAL, ERROR, WARNING, INFO or
DEBUG
–score-files OUTPUT_PATTERN A string format for the score files.
{suffix} is replaced by "scores" and "cv"
respectively.
-j, –param-file PARAM_FILE A json file from which to read parameters
-p, –param KEY VALUE A value for a parameter given as "key
value".Values are given as json values(so
quotations count).Can be passed multiple
times.
–external-folds N_FOLDS TOTAL_FOLDS
The number of folds to use and the total
folds for the external loop(default 10
10). These are used for training as well
on the entire dataset.
–internal-folds N_FOLDS TOTAL_FOLDS
The number of folds to use and the total
folds for the internal loop(default 10
10) ,!
–n-iter N_ITER The number of iterations for the internal
randomized search(default 20)
–n-jobs N_JOBS The number of jobs to start in
parallel(default -1)
–random-seed RANDOM_SEED The random seed to use
–model-file MODEL_FILE The file in which to save the pickled
model trained over the entire dataset.
–shuffle / –no-shuffle Whether to shuffle the dataset beforehand
–help Show this message and exit.
# ./manage.py publish –help
Usage: publish [OPTIONS] MODEL_FILE NAME
Publish the pickled model in MODEL_FILE under the name NAME.
Options:
-v, –verbosity LVL Either CRITICAL, ERROR, WARNING, INFO or
DEBUG
Listing 5: Help for the train andpublish commands. The first one allows training on a given CSV dataset and
saving the trained model with the help of the –model-file argument. The second one can take the serialized
model file and push it as as a classifier to the DB.
43
Nichita Ut ,iu Learning Web Content Extraction from DOM
Chapter 5
Conclusions and Future Work
We began this thesis by briefly presenting the problem of web content extraction ,
the field of machine learning and AI. Additionally, we glance over how the two topics
intersect and how this influences our approach.
After mentioning existing methods and the current state of the problem in literature
and its history, we continue by discussing the method we propose. We design a method
which uses information from the DOM tree to perform a classification task on HTML
tags. To test its performance we introduce a series of experiments to assess its perfor-
mance in terms of F1score on two popular datasets: Cleaneval and Dragnet .
From the experimental results we note that out method achieves a score of 0:86
on Cleaneval and 0:96on Dragnet. On the former, the performance is comparable to
the current state-of-the art, albeit lower, and on the latter, it achieves state-of-the-art
performance.
In Chapter 4, based on the algorithm discussed previously, we model a web ap-
plication which acts as a frontend to our algorithm, allowing users to extract content
from real webpages. Next we propose an architecture to sustain our model. In the
end, we elaborate on implementation details and show a brief overview of our tech stack.
As far as improvements are concerned, as discussed in Section 3.4, our method
could benefit from more extensive testing to provide more statistically-sound results.
Regarding the application, we consider it could be extended to accommodate different
usecases, such as serving as a tool for dataset creation.
45
Nichita Ut ,iu Learning Web Content Extraction from DOM
Bibliography
[1] M. Abadi, P . Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat,
G. Irving, M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray,
B. Steiner, P . Tucker, V . Vasudevan, P . Warden, M. Wicke, Y. Yu, X. Zheng, and G.
Brain, “TensorFlow: A System for Large-Scale Machine Learning TensorFlow: A
system for large-scale machine learning”, in 12th USENIX Symposium on Operating
Systems Design and Implementation (OSDI ’16) , 2016, pp. 265–284, I S B N : 978-1-
931971-33-1. D O I :10.1038/nn.3331 . arXiv: 1605.08695 .
[2] S. W. Ambler, “Making your objects persistent—object-orientation and databases”,
inBuilding Object Applications that Work: Your Step-by-Step Handbook for Devel-
oping Robust Systems with Object Technology , ser. SIGS: Managing Object Tech-
nology. Cambridge University Press, 1997, pp. 291–342. D O I :10.1017/CBO
9780511584947.012 .
[3] M. Baroni, F. Chantree, A. Kilgarriff, and S. Sharoff, “Cleaneval: a Competition
for Cleaning Web Pages.”, Lrec, pp. 638–643, 2008.
[4] Y. Bengio and Y. Grandvalet, “No unbiased estimator of the variance of k-fold
cross-validation”, Journal of machine learning research , vol. 5, no. Sep, pp. 1089–1105,
2004, I S S N : 1532-4435.
[5] J. Bergstra and Y. Bengio, “Random Search for Hyper-Parameter Optimization”,
Journal of Machine Learning Research , vol. 13, pp. 281–305, 2012, I S S N : 1532-4435.
D O I :10.1162/153244303322533223 . arXiv: 1504.05070 .
[6] Bureau of Labor Statistics. (2017). Web Developers : Occupational Outlook
Handbook: : U.S. Bureau of Labor Statistics, [Online]. Available: https://
www.bls.gov/ooh/computer- and- information- technology/web-
developers.htm%7B%5C#%7Dtab-6 (visited on 06/21/2018).
[7] R. Burget and I. Rudolfova, “Web Page Element Classification Based on Visual
Features”, 2009 First Asian Conference on Intelligent Information and Database Systems ,
pp. 67–72, 2009. D O I :10.1109/ACIIDS.2009.71 .
[8] D. Cai, S. Yu, J. -R. Wen, and W. -Y. Ma, “Extracting Content Structure for Web
Pages Based on Visual Representation”, Proceedings of the 5th Asia-Pacific web
conference on Web technologies and applications , pp. 406–417, 2003, I S S N : 03029743.
D O I :10.1007/3-540-36901-5_42 .
[9] G. C. Cawley and N. L. C. Talbot, “On Over-fitting in Model Selection and Sub-
sequent Selection Bias in Performance Evaluation”, Journal of Machine Learning
Research , vol. 11, pp. 2079–2107, 2010, I S S N : 1532-4435.
47
Nichita Ut ,iu Learning Web Content Extraction from DOM
[10] F. Chollet. (2015). Keras Documentation, [Online]. Available: https://keras.
io/%20https://keras.io/%7B%5C%%7D0Ahttps://keras.io .
[11] T. Christie. (2017). Django REST framework . version 3.8, [Online]. Available:
http://www.django-rest-framework.org/ (visited on 06/19/2018).
[12] ——, (2017). Django REST framework — API Documentation. version 3.8, [On-
line]. Available: http: //www. django- rest- framework.org /#api-
guide (visited on 06/20/2018).
[13] A. Church and A. M. Turing, “On computable numbers, with an application to
the entscheidungsproblem.”, The Journal of Symbolic Logic , vol. 2, no. 1, p. 42, Mar.
1937. D O I :10.2307/2268810 .
[14] F. Ciravegna, S. Chapman, A. Dingli, and Y. Wilks, “Learning to harvest informa-
tion for the semantic web”, Lecture Notes in Computer Science (including subseries
Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , vol. 3053,
pp. 312–326, 2004, I S S N : 03029743 16113349. D O I :10.1007/978-3-540-
25956-5_22 .
[15] C. Cortes and V . Vapnik, “Support-vector networks”, Machine Learning , vol. 20,
no. 3, pp. 273–297, Sep. 1995, I S S N : 1573-0565. D O I :10.1007/BF00994018 .
[16] Django Software Foundation. (2017). Django . version 2.0, [Online]. Available:
https://www.djangoproject.com/ (visited on 06/19/2018).
[17] ——, (2018). Django documentation, [Online]. Available: https://docs.
djangoproject.com/en/2.0/ (visited on 06/20/2018).
[18] Facebook Inc. (2018). A JavaScript library for building user interfaces — React .
version 16.3, [Online]. Available: https : / / reactjs . org/ (visited on
06/19/2018).
[19] R. T. Fielding, “Architectural Styles and the Design of Network-based Software
Architectures”, Disertation, University of California, Irvine, 2000, p. 162, I S B N :
0599871180. D O I :10.1.1.91.2433 . arXiv: arXiv:1011.1669v3 .
[20] GitHub Inc. (2017). The State of the Octoverse 2017, [Online]. Available: https:
//octoverse.github.com/ (visited on 06/21/2018).
[21] P . Goldsborough, “A Tour of TensorFlow”, arXiv: 1701.07852 .
[22] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning . MIT Press, 2016, http:
//www.deeplearningbook.org ,I S B N : 9780262035613.
[23] J. Han, E. Haihong, G. Le, and J. Du, “Survey on NoSQL database”, in Proceedings
– 2011 6th International Conference on Pervasive Computing and Applications, ICPCA
2011 , 2011, pp. 363–366, I S B N : 9781457702082. D O I :10.1109/ICPCA.2011.
6106531 . arXiv: 1307.0191 .
48
Nichita Ut ,iu Learning Web Content Extraction from DOM
[24] J. Heaton, “An empirical analysis of feature engineering for predictive modeling”,
SoutheastCon 2016 , pp. 1–6, 2016. D O I:10.1109/SECON.2016.7506650 . arXiv:
1701.07852 .
[25] T. K. Ho, “Random decision forests”, in Proceedings of 3rd International Conference
on Document Analysis and Recognition , IEEE Comput. Soc. Press. D O I :10.1109/
icdar.1995.598994 .
[26] T. Hobbes, Thomas Hobbes: Leviathan (Longman Library of Primary Sources in Philoso-
phy). Routledge, 2016, I S B N : 9781315507606.
[27] D. Holth. (2012). PEP 427 – The Wheel Binary Package Format 1.0, [Online].
Available: https://www.python.org/dev/peps/pep-0427/ (visited on
06/19/2018).
[28] C. N. Hsu and M. T. Dung, “Generating finite-state transducers for semi-
structured data extraction from the Web”, Information Systems , vol. 23, no. 8,
pp. 521–538, 1998, I S S N : 03064379. D O I :10.1016/S0306-4379(98)00027-
1.
[29] P . J. Eby. (2003). PEP 333 – Python Web Server Gateway Interface v1.0, [Online].
Available: https://www.python.org/dev/peps/pep-0333/ (visited on
06/19/2018).
[30] N. Japkowicz and S. Stephen, “The class imbalance problem: A systematic study”,
Intelligent Data Analysis , vol. 6, no. 5, pp. 429–449, 2002, I S S N : 1088467X. D O I :
10.1.1.711.8214 .
[31] E. M. Karabulut, S. A. ¨Ozel, and T. ˙Ibrik c ¸i, “A comparative study on the effect of
feature selection on classification accuracy”, Procedia Technology , vol. 1, pp. 323–
327, 2012, I S S N : 22120173. D O I :10.1016/j.protcy.2012.02.068 .
[32] C. Kohlsch ¨utter, P . Fankhauser, and W. Nejdl, “Boilerplate detection using shallow
text features”, in Proceedings of the third ACM international conference on Web search
and data mining – WSDM ’10 , 2010, p. 441, I S B N : 9781605588896. D O I :10.1145/
1718487.1718542 .
[33] R. Kreuzer, J. Hage, and A. Feelders, “A quantitative comparison of semantic web
page segmentation approaches”, in Lecture Notes in Computer Science (including
subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) ,
vol. 9114, 2015, pp. 374–391, I S B N : 9783319198897. D O I:10.1007/978-3-319-
19890-3_24 .
[34] G. Lawton, “Developing software online with platform-as-a-service technology”,
Computer , vol. 41, no. 6, pp. 13–15, Jul. 2008, I S S N : 00189162. D O I :10.1109/MC.
2008.185 .
49
Nichita Ut ,iu Learning Web Content Extraction from DOM
[35] W. Liu, X. Meng, and W. Meng, “ViDE: A vision-based approach for deep web
data extraction”, IEEE Transactions on Knowledge and Data Engineering , vol. 22,
no. 3, pp. 447–460, 2010, I S S N : 10414347. D O I :10.1109/TKDE.2009.109 .
[36] P . McCorduck, Machines who think: A personal inquiry into the history and prospects
of artificial intelligence . AK Peters/CRC Press, 2004, I S B N : 1568812051.
[37] W. McKinney and P . D. Team. (2018). Pandas – Powerful Python Data Analysis Toolkit .
version 0.23.1, [Online]. Available: https://pandas.pydata.org/pandas-
docs/stable/ (visited on 06/20/2018).
[38] D. Merkel, “Docker: Lightweight linux containers for consistent development
and deployment”, Linux Journal , vol. 2014, no. 239, Mar. 2014, I S S N : 1075-3583.
[39] T. M. Mitchell et al. , “Machine learning. 1997”, Burr Ridge, IL: McGraw Hill , vol. 45,
no. 37, pp. 870–877, 1997.
[40] NumPy Developers. (2017). NumPy . version 1.14.5, [Online]. Available: http:
//www.numpy.org/ (visited on 06/20/2018).
[41] Object Management Group. (2017). About the Unified Modeling Language
Specification Version 2.5.1, [Online]. Available: https://www.omg.org/spec/
UML/2.5 (visited on 06/19/2018).
[42] J. Pasternack and D. Roth, “Extracting article text from the web with maximum
subsequence segmentation”, in Proceedings of the 18th international conference on
World wide web – WWW ’09 , 2009, p. 971, I S B N : 9781605584874. D O I :10.1145/
1526709.1526840 .
[43] F. Pedregosa, G. Varoquaux, A. Gramfort, V . Michel, B. Thirion, O. Grisel, M.
Blondel, P . Prettenhofer, R. Weiss, V . Dubourg, J. Vanderplas, A. Passos, D. Cour-
napeau, M. Brucher, M. Perrot, and ´E. Duchesnay, “Scikit-learn: Machine Learning
in Python”, Journal of Machine Learning Research , vol. 12, pp. 2825–2830, 2012, I S S N :
1532-4435. D O I :10.1007/s13398-014-0173-7.2 . arXiv: 1201.0490 .
[44] M. E. Peters and D. Lecocq. (2012). Dragnet . version 2.0.3, [Online]. Available:
https : / / github . com / seomoz / dragnet / tree / v2 . 0 . 3 (visited on
06/22/2018).
[45] ——, “Content extraction using diverse feature sets”, in Proceedings of the 22nd
international conference on World Wide Web companion , 2013, pp. 89–90, I S B N :
9781450320382.
[46] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representa-
tions by error propagation”, Tech. Rep., Sep. 1985. D O I:10.21236/ada164453 .
50
Nichita Ut ,iu Learning Web Content Extraction from DOM
[47] D. Ruppert, “The Elements of Statistical Learning: Data Mining, Inference, and
Prediction”, Journal of the American Statistical Association , vol. 99, no. 466, pp. 567–
567, 2004, I S S N : 0162-1459. D O I :10.1198/jasa.2004.s339 . arXiv: 1010.
3003 .
[48] S. J. Russell and P . Norvig, Artificial intelligence: a modern approach . Malaysia;
Pearson Education Limited, 2016, I S B N : 0136042597.
[49] A. L. Samuel, “Some Studies in Machine Learning Using the Game of Checkers.
I”, in Computer Games I , New York, NY: Springer New York, 1988, pp. 335–365,
I S B N : 9781461387183. D O I :10.1007/978-1-4613-8716-9_14 .
[50] S. Sanfilippo. (2018). Redis . version 4.0, [Online]. Available: https://redis.
io/ (visited on 06/19/2018).
[51] Scikit-Learn Developers. (2017). Scikit-Learn — API Reference. version 0.19.1,
[Online]. Available: http : / / scikit – learn . org / stable / modules /
classes.html (visited on 06/20/2018).
[52] ——, (2017). Scikit-Learn — Release History. version 0.19.1, [Online]. Avail-
able: http://scikit-learn.org/stable/whats_new.html (visited on
06/20/2018).
[53] H. A. Sleiman and R. Corchuelo, “A survey on region extractors from web doc-
uments”, vol. 25, no. 9, pp. 1960–1981, 2013, I S S N : 10414347. D O I :10.1109/
TKDE.2012.135 .
[54] A. Solem. (2018). Celery: Distributed Task Queue . version 4.2, [Online]. Available:
http://www.celeryproject.org/ (visited on 06/19/2018).
[55] D. Song, F. Sun, and L. Liao, “A hybrid approach for content extraction with text
density and visual importance of DOM nodes”, Knowledge and Information Systems ,
vol. 42, no. 1, pp. 75–96, 2015, I S S N : 02193116. D O I :10.1007/s10115-013-
0687-x .
[56] R. Song, H. Liu, J. -R. Wen, and W. -Y. Ma, “Learning block importance models
for web pages”, Proceedings of the 13th conference on World Wide Web – WWW ’04 ,
vol. 49, p. 203, 2004, I S S N : 158113844X. D O I :10.1145/988672.988700 .
[57] F. Sun, D. Song, and L. Liao, “DOM based content extraction via text density”,
inProceedings of the 34th international ACM SIGIR conference on Research and de-
velopment in Information – SIGIR ’11 , 2011, p. 245, I S B N : 9781450307574. D O I :
10.1145/2009916.2009952 .
[58] The PostgreSQL Global Development Group. (2018). PostgreSQL: The world’s
most advanced open source database . version 9.2, [Online]. Available: https :
//www.postgresql.org/ (visited on 06/19/2018).
51
Nichita Ut ,iu Learning Web Content Extraction from DOM
[59] R. Thomas. (2017). Big deep learning news: Google Tensorflow chooses Keras,
[Online]. Available: http://www.fast.ai/2017/01/03/keras/ (visited on
06/20/2018).
[60] A. M. Turing, “Computing machinery and intelligence”, Mind , vol. LIX, no. 236,
pp. 433–460, 1950. D O I :10.1093/mind/lix.236.433 .
[61] S. Vadrevu, F. Gelgi, and H. Davulcu, “Semantic partitioning of web pages”,
inLecture Notes in Computer Science (including subseries Lecture Notes in Artificial
Intelligence and Lecture Notes in Bioinformatics) , vol. 3806 LNCS, 2005, pp. 107–118,
I S B N : 3540300171. D O I :10.1007/11581062_9 .
[62] J. Vainikka, “Full-stack web development using Django REST framework and
React”, Bachelor Thesis, Metropolia University of Applied Sciences, 2018.
[63] T. Vogels, O. -E. Ganea, and C. Eickhoff, “Web2Text: Deep Structured Boiler-
plate Removal”, in European Conference on Information Retrieval , 2018, I S B N :
9783319769400. D O I :10.1007/978- 3- 319- 76941- 7_13 . arXiv: 1801.
02607 .
[64] W3C. (2016). HTML 5.2, [Online]. Available: https://www.w3.org/TR/
2017/REC-html52-20171214/ (visited on 04/28/2018).
[65] ——, (2017). XML Path Language (XPath) 3.1, [Online]. Available: https://www.
w3.org/TR/2017/REC-xpath-31-20170321/ (visited on 06/20/2018).
[66] S. H. Walker and D. B. Duncan, “Estimation of the probability of an event as a
function of several independent variables”, Biometrika , vol. 54, no. 1/2, pp. 167–
179, 1967, I S S N : 00063444.
[67] T. Weninger, W. H. Hsu, and J. Han, “CETR: content extraction via tag ratios”,
Proceedings of the 19th international conference on World wide web , pp. 971–980, 2010.
D O I :10.1145/1772690.1772789 .
[68] A. Wiggins. (2012). The Twelve-Factor App, [Online]. Available: https://
12factor.net/ (visited on 06/19/2018).
[69] G. Wu, L. Li, X. Hu, and X. Wu, “Web news extraction via path ratios”, in Pro-
ceedings of the 22nd ACM international conference on Conference on information &
knowledge management – CIKM ’13 , New York, New York, USA: ACM Press, 2013,
pp. 2059–2068, I S B N : 9781450322638. D O I :10.1145/2505515.2505558 .
[70] J. Yao and X. Zuo, “A Machine Learning Approach to Webpage Content Exrac-
tion”, 2013.
52
Copyright Notice
© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.
Acest articol: Supervisors: Asist. dr. Vlad-Sebastian Ionescu [618962] (ID: 618962)
Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.
