End-to-End Relation Extraction using LSTMs [624400]

End-to-End Relation Extraction using LSTMs
on Sequences and Tree Structures
Makoto Miwa
Toyota Technological Institute
Nagoya, 468-8511, Japan
[anonimizat] Bansal
Toyota Technological Institute at Chicago
Chicago, IL, 60637, USA
[anonimizat]
Abstract
We present a novel end-to-end neural
model to extract entities and relations be-
tween them. Our recurrent neural net-
work based model captures both word se-
quence and dependency tree substructure
information by stacking bidirectional tree-
structured LSTM-RNNs on bidirectional
sequential LSTM-RNNs. This allows our
model to jointly represent both entities and
relations with shared parameters in a sin-
gle model. We further encourage detec-
tion of entities during training and use of
entity information in relation extraction
via entity pretraining and scheduled sam-
pling. Our model improves over the state-
of-the-art feature-based model on end-to-
end relation extraction, achieving 12.1%
and 5.7% relative error reductions in F1-
score on ACE2005 and ACE2004, respec-
tively. We also show that our LSTM-
RNN based model compares favorably to
the state-of-the-art CNN based model (in
F1-score) on nominal relation classifica-
tion (SemEval-2010 Task 8). Finally, we
present an extensive ablation analysis of
several model components.
1 Introduction
Extracting semantic relations between entities in
text is an important and well-studied task in in-
formation extraction and natural language pro-
cessing (NLP). Traditional systems treat this task
as a pipeline of two separated tasks, i.e., named
entity recognition (NER) (Nadeau and Sekine,
2007; Ratinov and Roth, 2009) and relation
extraction (Zelenko et al., 2003; Zhou et al.,
2005), but recent studies show that end-to-end(joint) modeling of entity and relation is impor-
tant for high performance (Li and Ji, 2014; Miwa
and Sasaki, 2014) since relations interact closely
with entity information. For instance, to learn
that Toefting and Bolton have an Organization-
Affiliation (ORG-AFF) relation in the sentence
Toefting transferred to Bolton , the entity informa-
tion that Toefting andBolton arePerson andOrga-
nization entities is important. Extraction of these
entities is in turn encouraged by the presence of
the context words transferred to , which indicate an
employment relation. Previous joint models have
employed feature-based structured learning. An
alternative approach to this end-to-end relation ex-
traction task is to employ automatic feature learn-
ing via neural network (NN) based models.
There are two ways to represent relations be-
tween entities using neural networks: recur-
rent/recursive neural networks (RNNs) and convo-
lutional neural networks (CNNs). Among these,
RNNs can directly represent essential linguis-
tic structures, i.e., word sequences (Hammerton,
2001) and constituent/dependency trees (Tai et
al., 2015). Despite this representation ability,
for relation classification tasks, the previously re-
ported performance using long short-term memory
(LSTM) based RNNs (Xu et al., 2015b; Li et al.,
2015) is worse than one using CNNs (dos Santos
et al., 2015). These previous LSTM-based sys-
tems mostly include limited linguistic structures
and neural architectures, and do not model entities
and relations jointly. We are able to achieve im-
provements over state-of-the-art models via end-
to-end modeling of entities and relations based on
richer LSTM-RNN architectures that incorporate
complementary linguistic structures.
Word sequence and tree structure are known to
be complementary information for extracting rela-
tions. For instance, dependencies between wordsarXiv:1601.00770v3 [cs.CL] 8 Jun 2016

are not enough to predict that source and U.S.
have an ORG-AFF relation in the sentence “This
is … ”, one U.S. source said , and the context word
said is required for this prediction. Many tradi-
tional, feature-based relation classification mod-
els extract features from both sequences and parse
trees (Zhou et al., 2005). However, previous RNN-
based models focus on only one of these linguistic
structures (Socher et al., 2012).
We present a novel end-to-end model to extract
relations between entities on both word sequence
and dependency tree structures. Our model allows
joint modeling of entities and relations in a sin-
gle model by using both bidirectional sequential
(left-to-right and right-to-left) and bidirectional
tree-structured (bottom-up and top-down) LSTM-
RNNs. Our model first detects entities and then
extracts relations between the detected entities us-
ing a single incrementally-decoded NN structure,
and the NN parameters are jointly updated using
both entity and relation labels. Unlike traditional
incremental end-to-end relation extraction models,
our model further incorporates two enhancements
into training: entity pretraining, which pretrains
the entity model, and scheduled sampling (Ben-
gio et al., 2015), which replaces (unreliable) pre-
dicted labels with gold labels in a certain probabil-
ity. These enhancements alleviate the problem of
low-performance entity detection in early stages
of training, as well as allow entity information to
further help downstream relation classification.
On end-to-end relation extraction, we improve
over the state-of-the-art feature-based model, with
12.1% (ACE2005) and 5.7% (ACE2004) relative
error reductions in F1-score. On nominal relation
classification (SemEval-2010 Task 8), our model
compares favorably to the state-of-the-art CNN-
based model in F1-score. Finally, we also ab-
late and compare our various model components,
which leads to some key findings (both positive
and negative) about the contribution and effec-
tiveness of different RNN structures, input depen-
dency relation structures, different parsing mod-
els, external resources, and joint learning settings.
2 Related Work
LSTM-RNNs have been widely used for sequen-
tial labeling, such as clause identification (Ham-
merton, 2001), phonetic labeling (Graves and
Schmidhuber, 2005), and NER (Hammerton,
2003). Recently, Huang et al. (2015) showed thatbuilding a conditional random field (CRF) layer on
top of bidirectional LSTM-RNNs performs com-
parably to the state-of-the-art methods in the part-
of-speech (POS) tagging, chunking, and NER.
For relation classification, in addition to tra-
ditional feature/kernel-based approaches (Zelenko
et al., 2003; Bunescu and Mooney, 2005), sev-
eral neural models have been proposed in the
SemEval-2010 Task 8 (Hendrickx et al., 2010),
including embedding-based models (Hashimoto
et al., 2015), CNN-based models (dos Santos et
al., 2015), and RNN-based models (Socher et al.,
2012). Recently, Xu et al. (2015a) and Xu et
al. (2015b) showed that the shortest dependency
paths between relation arguments, which were
used in feature/kernel-based systems (Bunescu
and Mooney, 2005), are also useful in NN-based
models. Xu et al. (2015b) also showed that LSTM-
RNNs are useful for relation classification, but the
performance was worse than CNN-based models.
Li et al. (2015) compared separate sequence-based
and tree-structured LSTM-RNNs on relation clas-
sification, using basic RNN model structures.
Research on tree-structured LSTM-RNNs (Tai
et al., 2015) fixes the direction of information
propagation from bottom to top, and also cannot
handle an arbitrary number of typed children as in
a typed dependency tree. Furthermore, no RNN-
based relation classification model simultaneously
uses word sequence and dependency tree informa-
tion. We propose several such novel model struc-
tures and training settings, investigating the simul-
taneous use of bidirectional sequential and bidi-
rectional tree-structured LSTM-RNNs to jointly
capture linear and dependency context for end-to-
end extraction of relations between entities.
As for end-to-end (joint) extraction of relations
between entities, all existing models are feature-
based systems (and no NN-based model has been
proposed). Such models include structured pre-
diction (Li and Ji, 2014; Miwa and Sasaki,
2014), integer linear programming (Roth and Yih,
2007; Yang and Cardie, 2013), card-pyramid pars-
ing (Kate and Mooney, 2010), and global prob-
abilistic graphical models (Yu and Lam, 2010;
Singh et al., 2013). Among these, structured pre-
diction methods are state-of-the-art on several cor-
pora. We present an improved, NN-based alterna-
tive for the end-to-end relation extraction.

In 1909 , Sidney Yates was born in Chicago .B-PER L-PER
word/POS
embeddings Bi-LSTM hidden softmax
nsubjpass preppobjYates born
in
Chicago PHYS
Bi-TreeLSTM hidden softmax
Sequence (Entity) Dependency (Relation)
LSTM unit dropout
tanh
tanh
dependency embeddings tanhlabel embeddings embeddings neural net / softmax
・・・ ・・・Fig. 1: Our incrementally-decoded end-to-end relation extraction model, with bidirectional sequential
and bidirectional tree-structured LSTM-RNNs.
3 Model
We design our model with LSTM-RNNs that rep-
resent both word sequences and dependency tree
structures, and perform end-to-end extraction of
relations between entities on top of these RNNs.
Fig. 1 illustrates the overview of the model. The
model mainly consists of three representation lay-
ers: a word embeddings layer (embedding layer),
a word sequence based LSTM-RNN layer (se-
quence layer), and finally a dependency subtree
based LSTM-RNN layer (dependency layer). Dur-
ing decoding, we build greedy, left-to-right entity
detection on the sequence layer and realize rela-
tion classification on the dependency layers, where
each subtree based LSTM-RNN corresponds to
a relation candidate between two detected enti-
ties. After decoding the entire model structure, we
update the parameters simultaneously via back-
propagation through time (BPTT) (Werbos, 1990).
The dependency layers are stacked on the se-
quence layer, so the embedding and sequence lay-
ers are shared by both entity detection and rela-
tion classification, and the shared parameters are
affected by both entity and relation labels.
3.1 Embedding Layer
The embedding layer handles embedding repre-
sentations.nw,np,ndandne-dimensional vectors
v(w),v(p),v(d)andv(e)are embedded to words,
part-of-speech (POS) tags, dependency types, and
entity labels, respectively.3.2 Sequence Layer
The sequence layer represents words in a linear se-
quence using the representations from the embed-
ding layer. This layer represents sentential con-
text information and maintains entities, as shown
in bottom-left part of Fig. 1.
We represent the word sequence in a sentence
with bidirectional LSTM-RNNs (Graves et al.,
2013). The LSTM unit at t-th word consists of
a collection of nls-dimensional vectors: an input
gateit, a forget gate ft, an output gate ot, a mem-
ory cellct, and a hidden state ht. The unit re-
ceives ann-dimensional input vector xt, the previ-
ous hidden state ht1, and the memory cell ct1,
and calculates the new vectors using the following
equations:
it=
W(i)xt+U(i)ht1+b(i)
; (1)
ft=
W(f)xt+U(f)ht1+b(f)
;
ot=
W(o)xt+U(o)ht1+b(o)
;
ut= tanh
W(u)xt+U(u)ht1+b(u)
;
ct=it ut+ft ct1;
ht=ot tanh(ct);
wheredenotes the logistic function, denotes
element-wise multiplication, WandUare weight
matrices, and bare bias vectors. The LSTM unit
att-th word receives the concatenation of word
and POS embeddings as its input vector: xt=h
v(w)
t;v(p)
ti
. We also concatenate the hidden state
vectors of the two directions’ LSTM units corre-
sponding to each word (denoted as !htand ht) as

its output vector, st=h !ht; hti
, and pass it to the
subsequent layers.
3.3 Entity Detection
We treat entity detection as a sequence labeling
task. We assign an entity tag to each word us-
ing a commonly used encoding scheme BILOU
(Begin, Inside, Last, Outside, Unit) (Ratinov and
Roth, 2009), where each entity tag represents the
entity type and the position of a word in the entity.
For example, in Fig. 1, we assign B-PER andL-
PER (which denote the beginning and last words
of a person entity type, respectively) to each word
inSidney Yates to represent this phrase as a PER
(person) entity type.
We perform entity detection on top of the se-
quence layer. We employ a two-layered NN with
annhe-dimensional hidden layer h(e)and a soft-
max output layer for entity detection.
h(e)
t= tanh
W(eh)[st;v(e)
t1] +b(eh)
(2)
yt=softmax
W(ey)h(e)
t+b(ey)
(3)
Here,Ware weight matrices and bare bias vec-
tors.
We assign entity labels to words in a greedy,
left-to-right manner.1During this decoding, we
use the predicted label of a word to predict the
label of the next word so as to take label depen-
dencies into account. The NN above receives the
concatenation of its corresponding outputs in the
sequence layer and the label embedding for its pre-
vious word (Fig. 1).
3.4 Dependency Layer
The dependency layer represents a relation be-
tween a pair of two target words (corresponding
to a relation candidate in relation classification) in
the dependency tree, and is in charge of relation-
specific representations, as is shown in top-right
part of Fig. 1. This layer mainly focuses on the
shortest path between a pair of target words in the
dependency tree (i.e., the path between the least
common node and the two target words) since
these paths are shown to be effective in relation
classification (Xu et al., 2015a). For example, we
show the shortest path between Yates andChicago
in the bottom of Fig. 1, and this path well captures
the key phrase of their relation, i.e., born in .
1We also tried beam search but this did not show improve-
ments in initial experiments.We employ bidirectional tree-structured LSTM-
RNNs (i.e., bottom-up and top-down) to represent
a relation candidate by capturing the dependency
structure around the target word pair. This bidirec-
tional structure propagates to each node not only
the information from the leaves but also informa-
tion from the root. This is especially important
for relation classification, which makes use of ar-
gument nodes near the bottom of the tree, and our
top-down LSTM-RNN sends information from the
top of the tree to such near-leaf nodes (unlike in
standard bottom-up LSTM-RNNs).2Note that the
two variants of tree-structured LSTM-RNNs by
Tai et al. (2015) are not able to represent our tar-
get structures which have a variable number of
typed children: the Child-Sum Tree-LSTM does
not deal with types and the N-ary Tree assumes
a fixed number of children. We thus propose a
new variant of tree-structured LSTM-RNN that
shares weight matrices Us for same-type children
and also allows variable number of children. For
this variant, we calculate nlt-dimensional vectors
in the LSTM unit at t-th node with C(t)children
using following equations:
it=0
@W(i)xt+X
l2C(t)U(i)
m(l)htl+b(i)1
A;(4)
ftk=0
@W(f)xt+X
l2C(t)U(f)
m(k)m(l)htl+b(f)1
A;
ot=0
@W(o)xt+X
l2C(t)U(o)
m(l)htl+b(o)1
A;
ut= tanh0
@W(u)xt+X
l2C(t)U(u)
m(l)htl+b(u)1
A;
ct=it ut+X
l2C(t)ftl ctl;
ht=ot tanh(ct);
wherem()is a type mapping function.
To investigate appropriate structures to repre-
sent relations between two target word pairs, we
experiment with three structure options. We pri-
marily employ the shortest path structure ( SP-
Tree ), which captures the core dependency path
between a target word pair and is widely used in
relation classification models, e.g., (Bunescu and
2We also tried to use one LSTM-RNN by connecting the
root (Paulus et al., 2014), but preparing two LSTM-RNNs
showed slightly better performance in our initial experiments.

Mooney, 2005; Xu et al., 2015a). We also try two
other dependency structures: SubTree andFull-
Tree . SubTree is the subtree under the lowest
common ancestor of the target word pair. This pro-
vides additional modifier information to the path
and the word pair in SPTree. FullTree is the full
dependency tree. This captures context from the
entire sentence. While we use one node type for
SPTree, we define two node types for SubTree and
FullTree, i.e., one for nodes on shortest paths and
one for all other nodes. We use the type mapping
functionm()to distinguish these two nodes types.
3.5 Stacking Sequence and Dependency
Layers
We stack the dependency layers (corresponding to
relation candidates) on top of the sequence layer to
incorporate both word sequence and dependency
tree structure information into the output. The
dependency-layer LSTM unit at the t-th word re-
ceives as input xt=h
st;v(d)
t;v(e)
ti
, i.e., the con-
catenation of its corresponding hidden state vec-
torsstin the sequence layer, dependency type
embeddingv(d)
t(denotes the type of dependency
to the parent3), and label embedding v(e)
t(corre-
sponds to the predicted entity label).
3.6 Relation Classification
We incrementally build relation candidates using
all possible combinations of the last words of de-
tected entities, i.e., words with L or U labels in
the BILOU scheme, during decoding. For in-
stance, in Fig. 1, we build a relation candidate us-
ingYates with an L-PER label and Chicago with
anU-LOC label. For each relation candidate, we
realize the dependency layer dp(described above)
corresponding to the path between the word pair
pin the relation candidate, and the NN receives a
relation candidate vector constructed from the out-
put of the dependency tree layer, and predicts its
relation label. We treat a pair as a negative relation
when the detected entities are wrong or when the
pair has no relation. We represent relation labels
by type and direction, except for negative relations
that have no direction.
The relation candidate vector is constructed as
the concatenation dp= ["hpA;#hp1;#hp2], where
"hpAis the hidden state vector of the top LSTM
3We use the dependency to the parent since the number of
children varies. Dependency types can also be incorporated
intom(), but this did not help in initial experiments.unit in the bottom-up LSTM-RNN (representing
the lowest common ancestor of the target word
pairp), and#hp1,#hp2are the hidden state vec-
tors of the two LSTM units representing the first
and second target words in the top-down LSTM-
RNN.4All the corresponding arrows are shown in
Fig. 1.
Similarly to the entity detection, we employ a
two-layered NN with an nhr-dimensional hidden
layerh(r)and a softmax output layer (with weight
matricesW, bias vectors b).
h(r)
p= tanh
W(rh)dp+b(rh)
(5)
yp=softmax
W(ry)h(r)
t+b(ry)
(6)
We construct the input dpfor relation classifi-
cation from tree-structured LSTM-RNNs stacked
on sequential LSTM-RNNs, so the contribution
of sequence layer to the input is indirect. Fur-
thermore, our model uses words for represent-
ing entities, so it cannot fully use the entity in-
formation. To alleviate these problems, we di-
rectly concatenate the average of hidden state vec-
tors for each entity from the sequence layer to
the inputdpto relation classification, i.e., d0
p=h
dp;1
jIp1jP
i2Ip1si;1
jIp2jP
i2Ip2sii
(Pair), where
Ip1andIp2represent sets of word indices in the
first and second entities.5
Also, we assign two labels to each word pair in
prediction since we consider both left-to-right and
right-to-left directions. When the predicted labels
are inconsistent, we select the positive and more
confident label, similar to Xu et al. (2015a).
3.7 Training
We update the model parameters including
weights, biases, and embeddings by BPTT and
Adam (Kingma and Ba, 2015) with gradient clip-
ping, parameter averaging, and L2-regularization
(we regularize weights WandU, not the bias
termsb). We also apply dropout (Srivastava et al.,
2014) to the embedding layer and to the final hid-
den layers for entity detection and relation classi-
fication.
We employ two enhancements, scheduled sam-
pling (Bengio et al., 2015) and entity pretrain-
ing, to alleviate the problem of unreliable pre-
diction of entities in the early stage of training,
4Note that the order of the target words corresponds to the
direction of the relation, not the positions in the sentence.
5Note that we do not show this Pair in Fig.1 for simplic-
ity.

and to encourage building positive relation in-
stances from the detected entities. In scheduled
sampling, we use gold labels as prediction in the
probability of ithat depends on the number of
epochsiduring training if the gold labels are le-
gal. As fori, we choose the inverse sigmoid de-
cayi=k=(k+ exp(i=k)), wherek(1)is a
hyper-parameter that adjusts how often we use the
gold labels as prediction. Entity pretraining is in-
spired by (Pentina et al., 2015), and we pretrain
the entity detection model using the training data
before training the entire model parameters.
4 Results and Discussion
4.1 Data and Task Settings
We evaluate on three datasets: ACE05 and ACE04
for end-to-end relation extraction, and SemEval-
2010 Task 8 for relation classification. We use the
first two datasets as our primary target, and use
the last one to thoroughly analyze and ablate the
relation classification part of our model.
ACE05 defines 7 coarse-grained entity types
and 6 coarse-grained relation types between enti-
ties. We use the same data splits, preprocessing,
and task settings as Li and Ji (2014). We report
the primary micro F1-scores as well as micro pre-
cision and recall on both entity and relation extrac-
tion to better explain model performance. We treat
an entity as correct when its type and the region of
its head are correct. We treat a relation as correct
when its type and argument entities are correct; we
thus treat all non-negative relations on wrong en-
tities as false positives.
ACE04 defines the same 7 coarse-grained en-
tity types as ACE05 (Doddington et al., 2004), but
defines 7 coarse-grained relation types. We fol-
low the cross-validation setting of Chan and Roth
(2011) and Li and Ji (2014), and the preprocessing
and evaluation metrics of ACE05.
SemEval-2010 Task 8 defines 9 relation types
between nominals and a tenth type Other when
two nouns have none of these relations (Hendrickx
et al., 2010). We treat this Other type as a nega-
tive relation type, and no direction is considered.
The dataset consists of 8,000 training and 2,717
test sentences, and each sentence is annotated with
a relation between two given nominals. We ran-
domly selected 800 sentences from the training set
as our development set. We followed the official
task setting, and report the official macro-averaged
F1-score (Macro-F1) on the 9 relation types.For more details of the data and task settings,
please refer to the supplementary material.
4.2 Experimental Settings
We implemented our model using the cnnlibrary.6
We parsed the texts using the Stanford neural de-
pendency parser7(Chen and Manning, 2014) with
the original Stanford Dependencies. Based on pre-
liminary tuning, we fixed embedding dimensions
nwto 200,np,nd,neto 25, and dimensions of
intermediate layers ( nls,nltof LSTM-RNNs and
nhe,nhrof hidden layers) to 100. We initialized
word vectors via word2vec (Mikolov et al., 2013)
trained on Wikipedia8and randomly initialized all
other parameters. We tuned hyper-parameters us-
ing development sets for ACE05 and SemEval-
2010 Task 8 to achieve high primary (Micro- and
Macro-) F1-scores.9For ACE04, we directly em-
ployed the best parameters for ACE05. The hyper-
parameter settings are shown in the supplementary
material. For SemEval-2010 Task 8, we also omit-
ted the entity detection and label embeddings since
only target nominals are annotated and the task de-
fines no entity types. Our statistical significance
results are based on the Approximate Randomiza-
tion (AR) test (Noreen, 1989).
4.3 End-to-end Relation Extraction Results
Table 1 compares our model with the state-of-the-
art feature-based model of Li and Ji (2014)10on
final test sets, and shows that our model performs
better than the state-of-the-art model.
To analyze the contributions and effects of the
various components of our end-to-end relation ex-
traction model, we perform ablation tests on the
ACE05 development set (Table 2). The perfor-
mance slightly degraded without scheduled sam-
pling, and the performance significantly degraded
when we removed entity pretraining or removed
both (p<0.05). This is reasonable because the
model can only create relation instances when
both of the entities are found and, without these
enhancements, it may get too late to find some re-
lations. Removing label embeddings did not affect
6https://github.com/clab/cnn
7http://nlp.stanford.edu/software/
stanford-corenlp-full-2015-04-20.zip
8https://dumps.wikimedia.org/enwiki/
20150901/
9We did not tune the precision-recall trade-offs, but doing
so can specifically improve precision further.
10Other work on ACE is not comparable or performs worse
than the model by Li and Ji (2014).

Corpus Settings Entity Relation
P R F1 P R F1
ACE05 Our Model (SPTree) 0.829 0.839 0.834 0.572 0.540 0.556
Li and Ji (2014) 0.852 0.769 0.808 0.654 0.398 0.495
ACE04 Our Model (SPTree) 0.808 0.829 0.818 0.487 0.481 0.484
Li and Ji (2014) 0.835 0.762 0.797 0.608 0.361 0.453
Table 1: Comparison with the state-of-the-art on the ACE05 test set and ACE04 dataset.
Settings Entity Relation
P R F1 P R F1
Our Model (SPTree) 0.815 0.821 0.818 0.506 0.529 0.518
Entity pretraining (EP) 0.793 0.798 0.796 0.494 0.491 0.492*
Scheduled sampling (SS) 0.812 0.818 0.815 0.522 0.490 0.505
Label embeddings (LE) 0.811 0.821 0.816 0.512 0.499 0.505
Shared parameters (Shared) 0.796 0.820 0.808 0.541 0.482 0.510
EP, SS 0.781 0.804 0.792 0.509 0.479 0.494*
EP, SS, LE, Shared 0.800 0.815 0.807 0.520 0.452 0.484**
Table 2: Ablation tests on the ACE05 development dataset. * denotes significance at p <0.05, ** denotes
p<0.01.
Settings Entity Relation
P R F1 P R F1
SPTree 0.815 0.821 0.818 0.506 0.529 0.518
SubTree 0.812 0.818 0.815 0.525 0.506 0.515
FullTree 0.806 0.816 0.811 0.536 0.507 0.521
SubTree (-SP) 0.803 0.816 0.810 0.533 0.495 0.514
FullTree (-SP) 0.804 0.817 0.811 0.517 0.470 0.492*
Child-Sum 0.806 0.819 0.8122 0.514 0.499 0.506
SPSeq 0.801 0.813 0.807 0.500 0.523 0.511
SPXu 0.809 0.818 0.813 0.494 0.522 0.508
Table 3: Comparison of LSTM-RNN structures on the ACE05 development dataset.
the entity detection performance, but this degraded
the recall in relation classification. This indicates
that entity label information is helpful in detecting
relations.
We also show the performance without shar-
ing parameters, i.e., embedding and sequence lay-
ers, for detecting entities and relations ( Shared
parameters ); we first train the entity detection
model, detect entities with the model, and build
aseparate relation extraction model using the
detected entities, i.e., without entity detection.
This setting can be regarded as a pipeline model
since two separate models are trained sequentially.
Without the shared parameters, both the perfor-
mance in entity detection and relation classifica-
tion drops slightly, although the differences arenot significant. When we removed all the en-
hancements, i.e., scheduled sampling, entity pre-
training, label embedding, and shared parameters,
the performance is significantly worse than SP-
Tree (p<0.01), showing that these enhancements
provide complementary benefits to end-to-end re-
lation extraction.
Next, we show the performance with differ-
ent LSTM-RNN structures in Table 3. We first
compare the three input dependency structures
(SPTree, SubTree, FullTree) for tree-structured
LSTM-RNNs. Performances on these three struc-
tures are almost same when we distinguish the
nodes in the shortest paths from other nodes,
but when we do not distinguish them (-SP), the
information outside of the shortest path, i.e.,

FullTree (-SP), significantly hurts performance
(p<0.05). We then compare our tree-structured
LSTM-RNN (SPTree) with the Child-Sum tree-
structured LSTM-RNN on the shortest path of Tai
et al. (2015). Child-Sum performs worse than our
SPTree model, but not with as big of a decrease
as above. This may be because the difference in
the models appears only on nodes that have multi-
ple children and all the nodes except for the least
common node have one child.
We finally show results with two counterparts
of sequence-based LSTM-RNNs using the short-
est path (last two rows in Table 3). SPSeq is a bidi-
rectional LSTM-RNN on the shortest path. The
LSTM unit receives input from the sequence layer
concatenated with embeddings for the surround-
ing dependency types and directions. We concate-
nate the outputs of the two RNNs for the relation
candidate. SPXu is our adaptation of the shortest
path LSTM-RNN proposed by Xu et al. (2015b)
to match our sequence-layer based model.11This
has two LSTM-RNNs for the left and right sub-
paths of the shortest path. We first calculate the
max pooling of the LSTM units for each of these
two RNNs, and then concatenate the outputs of the
pooling for the relation candidate. The compar-
ison with these sequence-based LSTM-RNNs in-
dicates that a tree-structured LSTM-RNN is com-
parable to sequence-based ones in representing
shortest paths.
Overall, the performance comparison of the
LSTM-RNN structures in Table 3 show that for
end-to-end relation extraction, selecting the ap-
propriate tree structure representation of the input
(i.e., the shortest path) is more important than the
choice of the LSTM-RNN structure on that input
(i.e., sequential versus tree-based).
4.4 Relation Classification Analysis Results
To thoroughly analyze the relation classification
part alone, e.g., comparing different LSTM struc-
tures, architecture components such as hidden lay-
ers and input information, and classification task
settings, we use the SemEval-2010 Task 8. This
dataset, often used to evaluate NN models for rela-
tion classification, annotates only relation-related
nominals (unlike ACE datasets), so we can focus
cleanly on the relation classification part.
11This is different from the original one in that we use the
sequence layer and we concatenate the embeddings for the in-
put, while the original one prepared individual LSTM-RNNs
for different inputs and concatenated their outputs.Settings Macro-F1
No External Knowledge Resources
Our Model (SPTree) 0.844
dos Santos et al. (2015) 0.841
Xu et al. (2015a) 0.840
+WordNet
Our Model (SPTree + WordNet) 0.855
Xu et al. (2015a) 0.856
Xu et al. (2015b) 0.837
Table 4: Comparison with state-of-the-art models
on SemEval-2010 Task 8 test-set.
Settings Macro-F1
SPTree 0:851
SubTree 0:839
FullTree 0:829
SubTree (-SP) 0:840
FullTree (-SP) 0:828
Child-Sum 0:838
SPSeq 0:844
SPXu 0:847
Table 5: Comparison of LSTM-RNN structures on
SemEval-2010 Task 8 development set.
We first report official test set results in Ta-
ble 4. Our novel LSTM-RNN model is compara-
ble to both the state-of-the-art CNN-based models
on this task with or without external sources, i.e.,
WordNet, unlike the previous best LSTM-RNN
model (Xu et al., 2015b).12
Next, we compare different LSTM-RNN struc-
tures in Table 5. As for the three input de-
pendency structures (SPTree, SubTree, FullTree),
FullTree performs significantly worse than other
structures regardless of whether or not we dis-
tinguish the nodes in the shortest paths from the
other nodes, which hints that the information out-
side of the shortest path significantly hurts the per-
formance (p <0.05). We also compare our tree-
structured LSTM-RNN (SPTree) with sequence-
based LSTM-RNNs (SPSeq and SPXu) and tree-
structured LSTM-RNNs (Child-Sum). All these
LSTM-RNNs perform slightly worse than our SP-
12When incorporating WordNet information into our
model, we prepared embeddings for WordNet hypernyms ex-
tracted by SuperSenseTagger (Ciaramita and Altun, 2006)
and concatenated the embeddings to the input vector (the con-
catenation of word and POS embeddings) of the sequence
LSTM. We tuned the dimension of the WordNet embeddings
and set it to 15 using the development dataset.

Settings Macro-F1
SPTree 0:851
Hidden layer 0:839
Sequence layer 0:840
Pair 0:844
Pair, Sequence layer 0:827
Stanford PCFG 0:844
+WordNet 0:854
Left-to-right candidates 0:843
Neg. sampling (Xu et al., 2015a) 0:848
Table 6: Model setting ablations on SemEval-
2010 development set.
Tree model, but the differences are small.
Overall, for relation classification, although
the performance comparison of the LSTM-RNN
structures in Table 5 produces different results on
FullTree as compared to the results on ACE05 in
Table 3, the trend still holds that selecting the ap-
propriate tree structure representation of the input
is more important than the choice of the LSTM-
RNN structure on that input.
Finally, Table 6 summarizes the contribution
of several model components and training set-
tings on SemEval relation classification. We first
remove the hidden layer by directly connecting
the LSTM-RNN layers to the softmax layers, and
found that this slightly degraded performance, but
the difference was small. We then skip the se-
quence layer and directly use the word and POS
embeddings for the dependency layer. Removing
the sequence layer13or entity-related information
from the sequence layer ( Pair) slightly degraded
performance, and, on removing both, the perfor-
mance dropped significantly (p <0.05). This indi-
cates that the sequence layer is necessary but the
last words of nominals are almost enough for ex-
pressing the relations in this task.
When we replace the Stanford neural depen-
dency parser with the Stanford lexicalized PCFG
parser (Stanford PCFG), the performance slightly
dropped, but the difference was small. This in-
dicates that the selection of parsing models is
not critical. We also included WordNet, and this
slightly improved the performance ( +WordNet),
but the difference was small. Lastly, for the gener-
ation of relation candidates, generating only left-
to-right candidates slightly degraded the perfor-
13Note that this setting still uses some sequence layer in-
formation since it uses the entity-related information (Pair).mance, but the difference was small and hence the
creation of right-to-left candidates was not critical.
Treating the inverse relation candidate as a nega-
tive instance (Negative sampling) also performed
comparably to other generation methods in our
model (unlike Xu et al. (2015a), which showed
a significance improvement over generating only
left-to-right candidates).
5 Conclusion
We presented a novel end-to-end relation extrac-
tion model that represents both word sequence
and dependency tree structures by using bidirec-
tional sequential and bidirectional tree-structured
LSTM-RNNs. This allowed us to represent both
entities and relations in a single model, achiev-
ing gains over the state-of-the-art, feature-based
system on end-to-end relation extraction (ACE04
and ACE05), and showing favorably compara-
ble performance to recent state-of-the-art CNN-
based models on nominal relation classification
(SemEval-2010 Task 8).
Our evaluation and ablation led to three key
findings. First, the use of both word sequence
and dependency tree structures is effective. Sec-
ond, training with the shared parameters improves
relation extraction accuracy, especially when em-
ployed with entity pretraining, scheduled sam-
pling, and label embeddings. Finally, the shortest
path, which has been widely used in relation clas-
sification, is also appropriate for representing tree
structures in neural LSTM models.
Acknowledgments
We thank Qi Li, Kevin Gimpel, and the anony-
mous reviewers for dataset details and helpful dis-
cussions.
References
[Bengio et al.2015] Samy Bengio, Oriol Vinyals,
Navdeep Jaitly, and Noam Shazeer. 2015.
Scheduled sampling for sequence prediction
with recurrent neural networks. arXiv preprint
arXiv:1506.03099 .
[Bunescu and Mooney2005] Razvan C Bunescu
and Raymond Mooney. 2005. A shortest path
dependency kernel for relation extraction. In
Proceedings of the conference on Human Lan-
guage Technology and Empirical Methods in

Natural Language Processing , pages 724–731.
ACL.
[Chan and Roth2011] Yee Seng Chan and Dan
Roth. 2011. Exploiting syntactico-semantic
structures for relation extraction. In Proceed-
ings of the 49th Annual Meeting of the Asso-
ciation for Computational Linguistics: Human
Language Technologies , pages 551–560, Port-
land, Oregon, USA, June. ACL.
[Chen and Manning2014] Danqi Chen and
Christopher Manning. 2014. A fast and
accurate dependency parser using neural net-
works. In Proceedings of the 2014 Conference
on Empirical Methods in Natural Language
Processing (EMNLP) , pages 740–750, Doha,
Qatar, October. ACL.
[Ciaramita and Altun2006] Massimiliano Cia-
ramita and Yasemin Altun. 2006. Broad-
coverage sense disambiguation and information
extraction with a supersense sequence tagger.
InProceedings of the 2006 Conference on
Empirical Methods in Natural Language
Processing , pages 594–602, Sydney, Australia,
July. ACL.
[Doddington et al.2004] George Doddington,
Alexis Mitchell, Mark Przybocki, Lance
Ramshaw, Stephanie Strassel, and Ralph
Weischedel. 2004. The automatic content
extraction (ace) program – tasks, data, and
evaluation. In Proceedings of the Fourth
International Conference on Language Re-
sources and Evaluation (LREC-2004) , Lisbon,
Portugal, May. European Language Resources
Association (ELRA).
[dos Santos et al.2015] Cicero dos Santos, Bing
Xiang, and Bowen Zhou. 2015. Classifying
relations by ranking with convolutional neural
networks. In Proceedings of the 53rd Annual
Meeting of the Association for Computational
Linguistics and the 7th International Joint Con-
ference on Natural Language Processing (Vol-
ume 1: Long Papers) , pages 626–634, Beijing,
China, July. ACL.
[Graves and Schmidhuber2005] Alex Graves and
J¨urgen Schmidhuber. 2005. Framewise
phoneme classification with bidirectional lstm
and other neural network architectures. Neural
Networks , 18(5):602–610.[Graves et al.2013] Alan Graves, Abdel-rahman
Mohamed, and Geoffrey Hinton. 2013. Speech
recognition with deep recurrent neural net-
works. In Acoustics, Speech and Signal Pro-
cessing (ICASSP), 2013 IEEE International
Conference on , pages 6645–6649. IEEE.
[Hammerton2001] James Hammerton. 2001.
Clause identification with long short-term mem-
ory. In Proceedings of the 2001 workshop
on Computational Natural Language Learning-
Volume 7 , page 22. ACL.
[Hammerton2003] James Hammerton. 2003.
Named entity recognition with long short-term
memory. In Walter Daelemans and Miles Os-
borne, editors, Proceedings of the Seventh Con-
ference on Natural Language Learning at HLT-
NAACL 2003 , pages 172–175. ACL.
[Hashimoto et al.2015] Kazuma Hashimoto, Pon-
tus Stenetorp, Makoto Miwa, and Yoshimasa
Tsuruoka. 2015. Task-oriented learning of
word embeddings for semantic relation classi-
fication. In Proceedings of the Nineteenth Con-
ference on Computational Natural Language
Learning , pages 268–278, Beijing, China, July.
ACL.
[Hendrickx et al.2010] Iris Hendrickx, Su Nam
Kim, Zornitsa Kozareva, Preslav Nakov, Di-
armuid ´O S´eaghdha, Sebastian Pad ´o, Marco
Pennacchiotti, Lorenza Romano, and Stan Sz-
pakowicz. 2010. Semeval-2010 task 8: Multi-
way classification of semantic relations between
pairs of nominals. In Proceedings of the 5th In-
ternational Workshop on Semantic Evaluation ,
pages 33–38, Uppsala, Sweden, July. ACL.
[Huang et al.2015] Zhiheng Huang, Wei Xu, and
Kai Yu. 2015. Bidirectional lstm-crf mod-
els for sequence tagging. arXiv preprint
arXiv:1508.01991 .
[Kate and Mooney2010] Rohit J. Kate and Ray-
mond Mooney. 2010. Joint entity and relation
extraction using card-pyramid parsing. In Pro-
ceedings of the Fourteenth Conference on Com-
putational Natural Language Learning , pages
203–212, Uppsala, Sweden, July. ACL.
[Kingma and Ba2015] Diederik Kingma and
Jimmy Ba. 2015. Adam: A method for
stochastic optimization. In ICLR 2015 , San
Diego, CA, May.

[Li and Ji2014] Qi Li and Heng Ji. 2014. Incre-
mental joint extraction of entity mentions and
relations. In Proceedings of the 52nd Annual
Meeting of the Association for Computational
Linguistics (Volume 1: Long Papers) , pages
402–412, Baltimore, Maryland, June. ACL.
[Li et al.2015] Jiwei Li, Thang Luong, Dan Juraf-
sky, and Eduard Hovy. 2015. When are tree
structures necessary for deep learning of repre-
sentations? In Proceedings of the 2015 Con-
ference on Empirical Methods in Natural Lan-
guage Processing , pages 2304–2314, Lisbon,
Portugal, September. ACL.
[Lu and Roth2015] Wei Lu and Dan Roth. 2015.
Joint mention extraction and classification with
mention hypergraphs. In Proceedings of the
2015 Conference on Empirical Methods in Nat-
ural Language Processing , pages 857–867, Lis-
bon, Portugal, September. ACL.
[Mikolov et al.2013] Tomas Mikolov, Ilya
Sutskever, Kai Chen, Greg S Corrado, and
Jeff Dean. 2013. Distributed representations of
words and phrases and their compositionality.
InAdvances in neural information processing
systems , pages 3111–3119.
[Miwa and Sasaki2014] Makoto Miwa and Yutaka
Sasaki. 2014. Modeling joint entity and re-
lation extraction with table representation. In
Proceedings of the 2014 Conference on Empir-
ical Methods in Natural Language Processing
(EMNLP) , pages 1858–1869, Doha, Qatar, Oc-
tober. ACL.
[Nadeau and Sekine2007] David Nadeau and
Satoshi Sekine. 2007. A survey of named en-
tity recognition and classification. Lingvisticae
Investigationes , 30(1):3–26.
[Noreen1989] Eric W. Noreen. 1989. Computer-
Intensive Methods for Testing Hypotheses : An
Introduction . Wiley-Interscience, April.
[Paulus et al.2014] Romain Paulus, Richard
Socher, and Christopher D Manning. 2014.
Global belief recursive neural networks. In
Z. Ghahramani, M. Welling, C. Cortes, N.D.
Lawrence, and K.Q. Weinberger, editors,
Advances in Neural Information Process-
ing Systems 27 , pages 2888–2896. Curran
Associates, Inc.[Pentina et al.2015] Anastasia Pentina, Viktoriia
Sharmanska, and Christoph H. Lampert. 2015.
Curriculum learning of multiple tasks. In IEEE
Conference on Computer Vision and Pattern
Recognition CVPR , pages 5492–5500, Boston,
MA, USA, June.
[Ratinov and Roth2009] Lev Ratinov and Dan
Roth. 2009. Design challenges and misconcep-
tions in named entity recognition. In Proceed-
ings of the Thirteenth Conference on Compu-
tational Natural Language Learning (CoNLL-
2009) , pages 147–155, Boulder, Colorado,
June. ACL.
[Roth and Yih2007] Dan Roth and Wen-Tau Yih,
2007. Global Inference for Entity and Relation
Identification via a Linear Programming For-
mulation . MIT Press.
[Singh et al.2013] Sameer Singh, Sebastian
Riedel, Brian Martin, Jiaping Zheng, and
Andrew McCallum. 2013. Joint inference of
entities, relations, and coreference. In Pro-
ceedings of the 2013 workshop on Automated
knowledge base construction , pages 1–6. ACM.
[Socher et al.2012] Richard Socher, Brody Huval,
Christopher D. Manning, and Andrew Y . Ng.
2012. Semantic compositionality through re-
cursive matrix-vector spaces. In Proceedings of
the 2012 Joint Conference on Empirical Meth-
ods in Natural Language Processing and Com-
putational Natural Language Learning , pages
1201–1211, Jeju Island, Korea, July. ACL.
[Srivastava et al.2014] Nitish Srivastava, Geoffrey
Hinton, Alex Krizhevsky, Ilya Sutskever, and
Ruslan Salakhutdinov. 2014. Dropout: A sim-
ple way to prevent neural networks from over-
fitting. The Journal of Machine Learning Re-
search , 15(1):1929–1958.
[Tai et al.2015] Kai Sheng Tai, Richard Socher,
and Christopher D. Manning. 2015. Improved
semantic representations from tree-structured
long short-term memory networks. In Proceed-
ings of the 53rd Annual Meeting of the Asso-
ciation for Computational Linguistics and the
7th International Joint Conference on Natural
Language Processing (Volume 1: Long Papers) ,
pages 1556–1566, Beijing, China, July. ACL.

[Werbos1990] Paul J Werbos. 1990. Backpropa-
gation through time: what it does and how to do
it.Proceedings of the IEEE , 78(10):1550–1560.
[Xu et al.2015a] Kun Xu, Yansong Feng, Song-
fang Huang, and Dongyan Zhao. 2015a. Se-
mantic relation classification via convolutional
neural networks with simple negative sampling.
InProceedings of the 2015 Conference on Em-
pirical Methods in Natural Language Process-
ing, pages 536–540, Lisbon, Portugal, Septem-
ber. ACL.
[Xu et al.2015b] Yan Xu, Lili Mou, Ge Li,
Yunchuan Chen, Hao Peng, and Zhi Jin. 2015b.
Classifying relations via long short term mem-
ory networks along shortest dependency paths.
InProceedings of the 2015 Conference on
Empirical Methods in Natural Language Pro-
cessing , pages 1785–1794, Lisbon, Portugal,
September. ACL.
[Yang and Cardie2013] Bishan Yang and Claire
Cardie. 2013. Joint inference for fine-grained
opinion extraction. In Proceedings of the 51st
Annual Meeting of the Association for Com-
putational Linguistics (Volume 1: Long Pa-
pers) , pages 1640–1649, Sofia, Bulgaria, Au-
gust. ACL.
[Yu and Lam2010] Xiaofeng Yu and Wai Lam.
2010. Jointly identifying entities and extract-
ing relations in encyclopedia text via a graphi-
cal model approach. In Coling 2010: Posters ,
pages 1399–1407, Beijing, China, August. Col-
ing 2010 Organizing Committee.
[Zelenko et al.2003] Dmitry Zelenko, Chinatsu
Aone, and Anthony Richardella. 2003. Kernel
methods for relation extraction. The Journal of
Machine Learning Research , 3:1083–1106.
[Zhou et al.2005] GuoDong Zhou, Jian Su, Jie
Zhang, and Min Zhang. 2005. Exploring
various knowledge in relation extraction. In
Proceedings of the 43rd Annual Meeting of
the Association for Computational Linguistics
(ACL’05) , pages 427–434, Ann Arbor, Michi-
gan, June. ACL.
A Supplemental Material
A.1 Data and Task Settings
ACE05 defines 7 coarse-grained entity types:
Facility ( FAC), Geo-Political Entities ( GPE ),Location ( LOC ), Organization ( ORG ), Person
(PER), Vehicle ( VEH ) and Weapon ( WEA ), and
6 coarse-grained relation types between enti-
ties: Artifact ( ART), Gen-Affiliation ( GEN-AFF ),
Org-Affiliation ( ORG-AFF ), Part-Whole ( PART-
WHOLE ), Person-Social ( PER-SOC ) and Physical
(PHYS ). We removed the cts,unsubsets, and used
a 351/80/80 train/dev/test split. We removed du-
plicated entities and relations, and resolved nested
entities. We used head spans for entities. We fol-
low the settings by (Li and Ji, 2014), and we did
not use the full mention boundary unlike Lu and
Roth (2015). We use entities andrelations to refer
toentity mentions andrelation mentions in ACE
for brevity.
ACE04 defines the same 7 coarse-grained entity
types as ACE05 (Doddington et al., 2004), but de-
fines 7 coarse-grained relation types: PYS,PER-
SOC , Employment / Membership / Subsidiary
(EMP-ORG ),ART,PER/ORG affiliation ( Other-
AFF),GPE affiliation ( GPE-AFF ), and Discourse
(DISC ). We follow the cross-validation setting of
Chan and Roth (2011) and Li and Ji (2014). We
removed DISC and did 5-fold CV on bnews and
nwire subsets (348 documents). We use the same
preprocessing and evaluation metrics of ACE05.
SemEval-2010 Task 8 defines 9 relation types
between nominals ( Cause-Effect ,Instrument-
Agency ,Product-Producer ,Content-Container ,
Entity-Origin ,Entity-Destination ,Component-
Whole ,Member-Collection and Message-Topic ),
and a tenth type Other when two nouns have none
of these relations (Hendrickx et al., 2010). We
treat this Other type as a negative relation type,
and no direction is considered. The dataset con-
sists of 8,000 training and 2,717 test sentences,
and each sentence is annotated with a relation be-
tween two given nominals. We randomly selected
800 sentences from the training set as our devel-
opment set. We followed the official task setting,
and report the official macro-averaged F1-score
(Macro-F1) on the 9 relation types.
A.2 Hyper-parameter Settings
Here we show the hyper-parameters and the range
tried for the hyper-parameters in parentheses.
Hyper-parameters include the initial learning rate
(5e-3, 2e-3, 1e-3, 5e-4, 2e-4, 1e-4), the regular-
ization parameter (1e-4, 1e-5, 1e-6, 1e-7), dropout
probabilities (0.0, 0.1, 0.2, 0.3, 0.4, 0.5), the size
of gradient clipping (1, 5, 10, 50, 100), scheduled

sampling parameter k(1, 5, 10, 50, 100), the num-
ber of epochs for training and entity pretraining ( 
100), and the embedding dimension of WordNet
hypernym (5, 10, 15, 20, 25, 30).

Similar Posts