DRAFT c0DJanuary7, 1999Christopher Manning HinrichSchütze. 141 [600375]

Licență

DRAFT c0DJanuary7, 1999Christopher Manning HinrichSchütze. 141 [600375]

Byadmin ianuarie 1, 2024

DRAFT! c/#0DJanuary7, 1999Christopher Manning& HinrichSchütze. 141
5 Collocations
ACOLLOCATION is an expression consisting of two or more words that
correspond to some conventional way of saying things. Or in the wordsof Firth (1957: 181): “Collocations of a given word are statements of thehabitual or customary places of that word.” Collocations include nounphrases like strong tea and weapons of mass destruction , phrasal verbs like
to make up , and other stock phrases like the rich and powerful . Particularly
interesting are the subtle and not-easily-explainable patterns of word usagethat native speakers all know: why we say a stiff breeze but not ??a stiff wind
(while either a strong breeze ora strong wind is okay), or why we speak of
broad daylight (but not ?bright daylight or??narrow darkness ).
Collocations are characterized by limited compositionality . We call a nat-
COMPOSITIONALITY
ural language expression compositional if the meaning of the expression
can be predicted from the meaning of the parts. Collocations are not fullycompositional in that there is usually an element of meaning added to thecombination. In the case of strong tea ,strong has acquired the meaning
rich in some active agent which is closely related, but slightly different from
the basic sense having great physical strength . Idioms are the most extreme
examples of non-compositionality. Idioms like to kick the bucket orto hear
it through the grapevine only have an indirect historical relationship to the
meanings of the parts of the expression. We are not talking about bucketsor grapevines literally when we use these idioms. Most collocations exhibitmilder forms of non-compositionality, like the expression international best
practice that we used as an example earlier in this book. It is very nearly a
systematic composition of its parts, but still has an element of added mean-
ing. It usually refers to administrative efﬁciency and would, for example,not be used to describe a cooking technique although that meaning wouldbe compatible with its literal meaning.
There is considerable overlap between the concept of collocation and no-
tions like term ,technical term , and terminological phrase . As these names sug-
TERM
TECHNICAL TERM
TERMINOLO GICAL PHRASE

142 5 Collocations
gest, the latter three are commonly used when collocations are extracted
from technical domains (in a process called terminology extraction ). The TERMINOLOGY EXTRACTION
reader be warned, though, that the word term has a different meaning in
information retrieval. There, it refers to both words and phrases. So itsubsumes the more narrow meaning that we will use in this chapter.
Collocations are important for a number of applications: natural lan-
guage generation (to make sure that the output sounds natural and mis-takes like powerful tea orto take a decision are avoided), computational lexi-
cography (to automatically identify the important collocations to be listedin a dictionary entry), parsing (so that preference can be given to parseswith natural collocations), and corpus linguistic research (for instance, thestudy of social phenomena like the reinforcement of cultural stereotypes
through language (Stubbs 1996)).
There is much interest in collocations partly because this is an area that
has been neglected in structural linguistic traditions that follow Saussureand Chomsky. There is, however, a tradition in British linguistics, associ-ated with the names of Firth, Halliday, and Sinclair, which pays close at-tention to phenomena like collocations. Structural linguistics concentrateson general abstractions about the properties of phrases and sentences. Incontrast, Firth’s Contextual Theory of Meaning emphasizes the importance
of context: the context of the social setting (as opposed to the idealizedspeaker), the context of spoken and textual discourse (as opposed to theisolated sentence), and, important for collocations, the context of surround-ing words (hence Firth’s famous dictum that a word is characterized by thecompany it keeps). These contextual features easily get lost in the abstract
treatment that is typical of structural linguistics.
A good example of the type of problem that is seen as important in this
contextual view of language is Halliday’s example of strong vs. power-ful tea (Halliday 1966: 150). It is a convention in English to talk aboutstrong tea , not powerful tea , although any speaker of English would also
understand the latter unconventional expression. Arguably, there are nointeresting structural properties of English that can be gleaned from thiscontrast. However, the contrast may tell us something interesting aboutattitudes towards different types of substances in our culture (why do weusepowerful for drugs like heroin, but not for cigarettes, tea and coffee?)
and it is obviously important to teach this contrast to students who wantto learn idiomatically correct English. Social implications of language useand language teaching are just the type of problem that British linguistsfollowing a Firthian approach are interested in.
In this chapter, we will introduce the principal approaches to ﬁnding col-

5.1 Frequency 143
locations: selection of collocations by frequency, selection based on mean
and variance of the distance between focal word and collocating word, hy-pothesis testing, and mutual information. We will then return to the ques-tion of what a collocation is and discuss in more depth different deﬁnitionsthat have been proposed and tests for deciding whether a phrase is a col-location or not. The chapter concludes with further readings and pointersto some of the literature that we were not able to include.
The reference corpus we will use in examples in this chapter consists
of four months of the New York Times newswire: from August throughNovember of 1990. This corpus has about 115 megabytes of text and roughly14 million words. Each approach will be applied to this corpus to makecomparison easier. For most of the chapter, the New York Times examples
will only be drawn from ﬁxed two-word phrases (or bigrams). It is im-
portant to keep in mind, however, that we chose this pool for convenienceonly. In general, both ﬁxed and variable word combinations can be colloca-tions. Indeed, the section on mean and variance looks at the more looselyconnected type.
5.1 Frequency
Surely the simplest method for ﬁnding collocations in a text corpus is count-ing. If two words occur together a lot, then that is evidence that they havea special function that is not simply explained as the function that resultsfrom their combination.
Predictably, just selecting the most frequently occurring bigrams is not
very interesting as is shown in Table 5.1. The table shows the bigrams(sequences of two adjacent words) that are most frequent in the corpus andtheir frequency. Except for New York , all the bigrams are pairs of function
words.
There is, however, a very simple heuristic that improves these results a
lot (Justeson and Katz 1995b): pass the candidate phrases through a part-of-speech ﬁlter which only lets through those patterns that are likely to be
“phrases”.
1Justeson and Katz (1995b: 17) suggest the patterns in Table 5.2.
Each is followed by an example from the text that they use as a test set. Inthese patterns A refers to an adjective, P to a preposition, and N to a noun.
Table 5.3 shows the most highly ranked phrases after applying the ﬁlter.
The results are surprisingly good. There are only 3 bigrams that we wouldnot regard as non-compositional phrases: last year ,last week , and ﬁrst time .
1. Similar ideas can be found in (Ross and Tukey 1975) and (Kupiec et al. 1995).

144 5 CollocationsC /#28 w
/1w
/2/#29 w
/1w
/2
80871 of the
58841 in the
26430 to the21842 on the21839 for the18568 and the16121 that the15630 at the15494 to be13899 in a13689 of a13361 by the13183 with the12622 from the
11428 New York
10007 he said
9775 as a9231 is a8753 has been8573 for a
Table 5.1 Finding Collocations: Raw Frequency. C /#28 /#01 /#29is the frequency of some-
thing in the corpus.
Tag Pattern Example
AN linear function
NN regression coefﬁcients
AAN Gaussian random variable
ANN cumulative distribution function
NAN mean squared error
NNN class probability function
NPN degrees of freedom
Table 5.2 Part of speech tag patterns for collocation ﬁltering. These patterns were
used by Justeson and Katz to identify likely collocations among frequently occur-ring word sequences.

5.1 Frequency 145C /#28 w
/1w
/2/#29 w
/1w
/2tag pattern
11487 New York A N
7261 United States A N5412 Los Angeles N N3301 last year A N3191 Saudi Arabia N N2699 last week A N2514 vice president A N2378 Persian Gulf A N
2161 San Francisco N N
2106 President Bush N N2001 Middle East A N1942 Saddam Hussein N N1867 Soviet Union A N1850 White House A N1633 United Nations A N1337 York City N N1328 oil prices N N1210 next year A N1074 chief executive A N1073 real estate A N
Table 5.3 Finding Collocations: Justeson and Katz’ part-of-speech ﬁlter.
York City is an artefact of the way we have implemented the Justeson and
Katz ﬁlter. The full implementation would search for the longest sequence
that ﬁts one of the part-of-speech patterns and would thus ﬁnd the longer
phrase New York City , which contains York City .
The twenty highest ranking phrases containing strong and powerful all
have the form A N (where A is either strong orpowerful ). We have listed
them in Table 5.4.
Again, given the simplicity of the method, these results are surprisingly
accurate. For example, they give evidence that strong challenge and powerful
computers are correct whereas powerful challenge and strong computers are
not. However, we can also see the limits of a frequency-based method.The nouns man and force are used with both adjectives ( strong force occurs
further down the list with a frequency of 4). A more sophisticated analysisis necessary in such cases.
Neither strong tea norpowerful tea occurs in our New York Times corpus.

146 5 Collocationsw C /#28strong /;w /#29 w C /#28powerful /;w /#29
support 50 force 13
safety 22 computers 10
sales 21 position 8
opposition 19 men 8
showing 18 computer 8
sense 18 man 7
message 15 symbol 6
defense 14 military 6
gains 13 machines 6
evidence 13 country 6
criticism 13 weapons 5
possibility 11 post 5
feelings 11 people 5
demand 11 nation 5
challenges 11 forces 5
challenge 11 chip 5
case 11 Germany 5
supporter 10 senators 4
signal 9 neighbor 4
man 9 magnet 4
Table 5.4 The nouns woccurring most often in the patterns “ strong w” and “ pow-
erful w”.
However, searching the larger corpus of the World Wide Web we ﬁnd 799
examples of strong tea and 17 examples of powerful tea (the latter mostly
in the computational linguistics literature on collocations), which indicatesthat the correct phrase is strong tea .
2
Justeson and Katz’ method of collocation discovery is instructive in that
it demonstrates an important point. A simple quantitative technique (thefrequency ﬁlter in this case) combined with a small amount of linguisticknowledge (the importance of parts of speech) goes a long way. In therest of this chapter, we will use a stop list that excludes words whose most
frequent tag is not a verb, noun or adjective.
Exercise 5-1
Add part-of-speech patterns useful for collocation discovery to Table 5.2, including
patterns longer than two tags.
2. This search was performed on AltaVista on March 28, 1998.

5.2 Mean and Variance 147
Sentence: Stocks crash as rescue plan teeters
Bigrams:
stocks crash stocks as stocks rescue
crash as crash rescue crash plan
as rescue as plan as teeters
rescue plan rescue teeters
plan teeters
Figure 5.1 Using a three word collocational window to capture bigrams at a dis-
tance.
Exercise 5-2
Pick a document in which your name occurs (an email, a university transcript or a
letter). Does Justeson and Katz’s ﬁlter identify your name as a collocation?
Exercise 5-3
We used the World Wide Web as an auxiliary corpus above because neither stong
teanorpowerful tea occurred in the New York Times. Modify Justeson and Katz’s
method so that it uses the World Wide Web as a resource of last resort.
5.2 Mean and Variance
Frequency-based search works well for ﬁxed phrases. But many colloca-
tions consist of two words that stand in a more ﬂexible relationship to oneanother. Consider the verb knock and one of its most frequent arguments,
door. Here are some examples of knocking on or at a door from our corpus:
(5.1) a. she knocked on his door
b. they knocked at the door
c. 100 women knocked on Donaldson’s door
d. a man knocked on the metal front door
The words that appear between knocked and door vary and the distance
between the two words is not constant so a ﬁxed phrase approach wouldnot work here. But there is enough regularity in the patterns to allow usto determine that knock is the right verb to use in English for this situation,
nothit,beat orrap.
A short note is in order here on collocations that occur as a ﬁxed phrase
versus those that are more variable. To simplify matters we only look atﬁxed phrase collocations in most of this chapter, and usually at just bi-
grams. But it is easy to see how to extend techniques applicable to bigrams

148 5 Collocations
to bigrams at a distance. We deﬁne a collocational window (usually a win-
dow of 3 to 4 words on each side of a word), and we enter every word pair
in there as a collocational bigram, as in Figure 5.1. We then proceed to doour calculations as usual on this larger pool of bigrams.
However, the mean and variance based methods described in this sec-
tion by deﬁnition look at the pattern of varying distance between twowords. If that pattern of distances is relatively predictable, then we haveevidence for a collocation like knock . . . door that is not necessarily a ﬁxed
phrase. We will return to this point and a more in-depth discussion of whata collocation is towards the end of this chapter.
One way of discovering the relationship between knocked and door is to
compute the mean and variance of the offsets (signed distances) between the
MEAN
VARIANCEtwo words in the corpus. The mean is simply the average offset. For the
examples in (5.1), we compute the mean offset between knocked and door as
follows:/1/4
/#28 /3 /+/3/+/5/+/5 /#29 /= /4 /: /0
(This assumes a tokenization of Donaldson’s as three words Donaldson , apos-
trophe, and s, which is what we actually did.) If there was an occurrence
ofdoor before knocked , then it would be entered as a negative number. For
example, /, /3forthe door that she knocked on . We restrict our analysis to po-
sitions in a window of size 9 around the focal word knocked .
The variance measures how much the individual offsets deviate from the
mean. We estimate it as follows./#1B
/2/=
Pni /=/1
/#28 di
/, /#16 /#29
/2n /, /1(5.2)
where nis the number of times the two words co-occur, diis the offset for
co-occurrence i, and /#16is the mean. If the offset is the same in all cases,
then the variance is zero. If the offsets are randomly distributed (whichwill be the case for two words which occur together by chance, but not in aparticular relationship), then the variance will be high. As is customary, weuse the standard deviation/#1B /=
p/#1B
/2, the square root of the variance, to assess STANDARD DEVIATION
how variable the offset between two words is. The standard deviation for
the four examples of knocked /door in the above case is /1 /: /1/5:/#1B /=
r/1/3
/,/#28/3 /, /4 /: /0/#29
/2/+/#28 /3 /, /4 /: /0/#29
/2/+/#28 /5 /, /4 /: /0/#29
/2/+/#28 /5 /, /4 /: /0/#29
/2
/#01/#19 /1 /: /1/5
The mean and standard deviation characterize the distribution of dis-
tances between two words in a corpus. We can use this information to dis-
cover collocations by looking for pairs with low standard deviation. A low

5.2 Mean and Variance 149
standard deviation means that the two words usually occur at about the
same distance. Zero standard deviation means that the two words alwaysoccur at exactly the same distance.
We can also explain the information that variance gets at in terms of
peaks in the distribution of one word with respect to another. Figure 5.2shows the three cases we are interested in. The distribution of strong with
respect to opposition has one clear peak at position/, /1(corresponding to
the phrase strong opposition ). Therefore the variance of strong with respect
toopposition is small ( /#1B /= /0 /: /6/7). The mean of /, /1 /: /1/5indicates that strong
usually occurs at position /, /1(disregarding the noise introduced by one
occurrence at /, /4).
We have restricted positions under consideration to a window of size
9 centered around the word of interest. This is because collocations are
essentially a local phenomenon. Note also that we always get a count of /0
at position /0when we look at the relationship between two different words.
This is because, for example, strong cannot appear in position /0in contexts
in which that position is already occupied by opposition .
Moving on to the second diagram in Figure 5.2, the distribution of strong
with respect to support is drawn out, with several negative positions having
large counts. For example, the count of approximately 20 at position /, /2is
due to uses like strong leftist support and strong business support . Because of
this greater variability we get a higher /#1B( /1 /: /0/7) and a mean that is between
positions /, /1and /, /2( /, /1 /: /4/5).
Finally, the occurrences of strong with respect to forare more evenly dis-
tributed. There is tendency for strong to occur before for(hence the neg-
ative mean of /, /1 /: /1/2), but it can pretty much occur anywhere around for.
The high standard deviation of /#1B /=/2 /: /1/5indicates this randomness. This
indicates that forand strong don’t form interesting collocations.
The word pairs in Table 5.5 indicate the types of collocations that can
be found by this approach. If the mean is close to /1 /: /0and the standard
deviation low, as is the case for New York , then we have the type of phrase
that Justeson and Katz’ frequency-based approach will also discover. Ifthe mean is much greater than/1 /: /0, then a low standard deviation indicates
an interesting phrase. The pair previous /games (distance 2) corresponds to
phrases like in the previous 10 games orin the previous 15 games ;minus /points
corresponds to phrases like minus 2 percentage points ,minus 3 percentage
points etc;hundreds /dollars corresponds to hundreds of billions of dollars and
hundreds of millions of dollars .
High standard deviation indicates that the two words of the pair stand
in no interesting relationship as demonstrated by the four high-variance

150 5 Collocations
50
20frequency
ofstrong
Position of strong with respect to opposition ( /#16 /= /, /1 /: /1/5 /;/#1B /=/0 /: /6/7).- 4 – 3 – 2 – 1 01234
/6/-
50
20frequency
ofstrong
Position of strong with respect to support ( /#16 /= /, /1 /: /4/5 /;/#1B /=/1 /: /0/7).- 4 – 3 – 2 – 1 01234
/6/-
50
20frequency
ofstrong
Position of strong with respect to for( /#16 /= /, /1 /: /1/2 /;/#1B /=/2 /: /1/5).- 4 – 3 – 2 – 1 01234
/6/-
Figure 5.2 Histograms of the position of strong relative to three words.

5.2 Mean and Variance 151/#1B /#16Count Word 1 Word 2
0.43 0.97 11657 New York
0.48 1.83 24 previous games
0.15 2.98 46 minus points
0.49 3.87 131 hundreds dollars
4.03 0.44 36 editorial Atlanta
4.03 0.00 78 ring New
3.96 0.19 119 point hundredth
3.96 0.29 106 subscribers by
1.07 1.45 80 strong support
1.13 2.57 7 powerful organizations
1.01 2.00 112 Richard Nixon
1.05 0.00 10 Garrison said
Table 5.5 Finding collocations based on mean and variance. Standard Deviation/#1Band mean /#16of the distances between 12 word pairs.
examples in Table 5.5. Note that means tend to be close to zero here as one
would expect for a uniform distribution. More interesting are the cases inbetween, word pairs that have large counts for several distances in theircollocational distribution. We already saw the example of strong { busi-
ness } support in Figure 5.2. The alternations captured in the other three
medium-variance examples are powerful { lobbying } organizations ,Richard
{ M. } Nixon , and Garrison said /said Garrison (remember that we tokenize
Richard M. Nixon as four tokens: Richard ,M,. ,Nixon ).
The method of variance-based collocation discovery that we have intro-
duced in this section is due to Smadja. We have simpliﬁed things some-what. In particular, Smadja (1993) uses an additional constraint that ﬁltersout “ﬂat” peaks in the position histogram, that is, peaks that are not sur-rounded by deep valleys (an example is at/, /2for the combination strong
/forin Figure 5.2). Smadja (1993) shows that the method is quite success-
ful at terminological extraction (with an estimated accuracy of 80%) and atdetermining appropriate phrases for natural language generation (Smadjaand McKeown 1990).
Smadja’s notion of collocation is less strict than many others’. The com-
bination knocked /door is probably not a collocation we want to classify as
terminology – although it may be very useful to identify for the purposeof text generation. Variance-based collocation discovery is the appropriatemethod if we want to ﬁnd this type of word combination, combinations

152 5 Collocations
of words that are in a looser relationship than ﬁxed phrases and that are
variable with respect to intervening material and relative position.
5.3 Hypothesis Testing
One difﬁculty that we have glossed over so far is that high frequency andlow variance can be accidental. If the two constituent words of a frequentbigram like new companies are frequently occurring words (as new and com-
panies are), then we expect the two words to co-occur a lot just by chance,
even if they do not form a collocation.
What we really want to know is whether two words occur together more
often than chance. Assessing whether or not something is a chance event
is one of the classical problems of statistics. It is usually couched in termsof hypothesis testing. We formulate a null hypothesisH/0that there is no NULL HYPOTHESIS
association between the words beyond chance occurrences, compute the
probability pthat the event would occur if H/0were true, and then rejectH/0if pis too low (typically if beneath a signiﬁcance level of p/#3C /0 /: /0/5, /0 /: /0/1, SIGNIFICANCE LEVEL/0 /: /0/0/5,o r /0 /: /0/0/1) and retain H/0as possible otherwise.3
It is important to note that this is a mode of data analysis where we look
at two things at the same time. As before, we are looking for particularpatterns in the data. But we are also taking into account how much datawe have seen. Even if there is a remarkable pattern, we will discount it ifwe haven’t seen enough data to be certain that it couldn’t be due to chance.
How can we apply the methodology of hypothesis testing to the problem
of ﬁnding collocations? We ﬁrst need to formulate a null hypothesis which
states what should be true if two words do not form a collocation. For such
a free combination of two words we will assume that each of the wordsw
/1
and w
/2is generated completely independently of the other, and so their
chance of coming together is simply given by:P /#28 w
/1w
/2/#29/= P /#28 w
/1/#29 P /#28 w
/2/#29
The model implies that the probability of co-occurrence is just the product
of the probabilities of the individual words. As we discuss at the end ofthis section, this is a rather simplistic model, and not empirically accurate,but for now we adopt independence as our null hypothesis.
3. Signiﬁcance at a level of /0 /: /0/5is the weakest evidence that is normally accepted in the
experimental sciences. The large amounts of data commonly available for Statistical NLP
tasks means the we can often expect to achieve greater levels of signiﬁcance.

5.3 Hypothesis Testing 153
5.3.1 The ttest
Next we need a statistical test that tells us how probable or improbable it is
that a certain constellation will occur. A test that has been widely used for
collocation discovery is the ttest. The ttest looks at the mean and variance
of a sample of measurements, where the null hypothesis is that the sampleis drawn from a distribution with mean/#16. The test looks at the difference
between the observed and expected means, scaled by the variance of thedata, and tells us how likely one is to get a sample of that mean and vari-ance (or a more extreme mean and variance) assuming that the sample isdrawn from a normal distribution with mean/#16. To determine the proba-
bility of getting our sample (or a more extreme sample), we compute the t
statistic:t /=
/#16 x /, /#16qs
/2N(5.3)
where /#16 xis the sample mean, s
/2is the sample variance, Nis the sample
size, and /#16is the mean of the distribution. If the tstatistic is large enough
we can reject the null hypothesis. We can ﬁnd out exactly how large it hasto be by looking up the table of thetdistribution we have compiled in the
appendix (or by using the better tables in a statistical reference book, or byusing appropriate computer software).
Here’s an example of applying thettest. Our null hypothesis is that
the mean height of a population of men is 158cm. We are given a sampleof 200 men with/#16 x /= /1/6/9and s
/2/= /2/6/0/0 and want to know whether this
sample is from the general population (the null hypothesis) or whether it
is from a different population of smaller men. This gives us the followingtaccording to the above formula:t /=
/1/6/9 /, /1/5/8q/2/6/0/0/2/0/0
/#19 /3 /: /0/5
If you look up the value of tthat corresponds to a conﬁdence level of/#0B /= /0 /: /0/0/5, you will ﬁnd /2 /: /5/7/6.4Since the twe got is larger than /2 /: /5/7/6,
we can reject the null hypothesis with 99.5% conﬁdence. So we can saythat the sample is not drawn from a population with mean 158cm, and ourprobability of error is less than 0.5%.
To see how to use thettest for ﬁnding collocations, let us compute thetvalue for new companies . What is the sample that we are measuring the
4. A sample of 200 means 199 degress of freedom, which corresponds to about the same tas/1degrees of freedom. This is the row of the table where we looked up /2 /: /5/7/6.

154 5 Collocations
mean and variance of? There is a standard way of extending the ttest
for use with proportions or counts. We think of the text corpus as a longsequence ofNbigrams, and the samples are then indicator random vari-
ables that take on the value 1 when the bigram of interest occurs, and are 0otherwise.
Using maximum likelihood estimates, we can compute the probabilities
ofnew and companies as follows. In our corpus, new occurs 15,828 times,
companies 4,675 times, and there are 14,307,668 tokens overall.P /#28new /#29/=
/1/5/8/2/8/1/4/3/0/7/6/6/8P /#28companies /#29/=
/4/6/7/5/1/4/3/0/7/6/6/8
The null hypothesis is that occurrences of new and companies are indepen-
dent.H/0
/: P /#28new companies /#29 /= P /#28new /#29 P /#28companies /#29/=
/1/5/8/2/8/1/4/3/0/7/6/6/8
/#02
/4/6/7/5/1/4/3/0/7/6/6/8
/#19 /3 /: /6/1/5 /#02 /1/0
/, /7
If the null hypothesis is true, then the process of randomly generating bi-
grams of words and assigning 1 to the outcome new companies and 0 to any
other outcome is in effect a Bernoulli trial with p /= /3 /: /6/1/5 /#02 /1/0
/, /7for the
probability of new company turning up. The mean for this distribution is/#16 /=/3 /: /6/1/5 /#02 /1/0
/, /7and the variance is /#1B
/2/= p /#28/1 /, p /#29(see Section 2.1.9), which
is approximately p. The approximation /#1B
/2/= p /#28/1 /, p /#29 /#19 pholds since for
most bigrams pis small.
It turns out that there are actually 8 occurrences of new companies among
the 14307668 bigrams in our corpus. So, for the sample, we have that thesample mean is:/#16 x /=
/8/1/4/3/0/7/6/6/8
/#19 /5 /: /5/9/1 /#02 /1/0
/, /7. Now we have everything we
need to apply the ttest:t /=
/#16 x /, /#16qs
/2N
/#19
/5 /: /5/9/1/1/0
/, /7/, /3 /: /6/1/5/1/0
/, /7q/5 /: /5/9/1/1/0
/, /7/1/4/3/0/7/6/6/8
/#19 /0 /: /9/9/9/9/3/2
This tvalue of 0.999932 is not larger than 2.576, the critical value for/#0B /=/0 /: /0/0/5. So we cannot reject the null hypothesis that new and companies
occur independently and do not form a collocation. That seems the rightresult here: the phrase new companies is completely compositional and there
is no element of added meaning here that would justify elevating it to thestatus of collocation. (Thetvalue is suspiciously close to 1.0, but that is a
coincidence. See Exercise 5-5.)

5.3 Hypothesis Testing 155t C /#28 w
/1/#29 C /#28 w
/2/#29 C /#28 w
/1w
/2/#29 w
/1w
/2
4.4721 42 20 20 Ayatollah Ruhollah
4.4721 41 27 20 Bette Midler
4.4720 30 117 20 Agatha Christie
4.4720 77 59 20 videocassette recorder
4.4720 24 320 20 unsalted butter
2.3714 14907 9017 20 ﬁrst made
2.2446 13484 10570 20 over many
1.3685 14734 13478 20 into them
1.2176 14093 14776 20 like people
0.8036 15019 15629 20 time last
Table 5.6 Finding collocations: The ttest applied to 10 bigrams that occur with
frequency 20.
Table 5.6 shows tvalues for ten bigrams that occur exactly 20 times in the
corpus. For the top ﬁve bigrams, we can reject the null hypothesis that the
component words occur independently for /#0B /= /0 /: /0/0/5, so these are good
candidates for collocations. The bottom ﬁve bigrams fail the test for signif-icance, so we will not regard them as good candidates for collocations.
Note that a frequency-based method would not be able to rank the ten
bigrams since they occur with exactly the same frequency. Looking at thecounts in Table 5.6, we can see that thettest takes into account the number
of co-occurrences of the bigram ( C /#28 w
/1w
/2/#29) relative to the frequencies of the
component words. If a high proportion of the occurrences of both words(Ayatollah Ruhollah ,videocassette recorder ) or at least a very high proportion
of the occurrences of one of the words ( unsalted ) occurs in the bigram, then
itstvalue is high. This criterion makes intuitive sense.
Unlike most of this chapter, the analysis in Table 5.6 includes some stop
words – without stop words, it is actually hard to ﬁnd examples that fail
signiﬁcance. It turns out that most bigrams attested in a corpus occur sig-
niﬁcantly more often than chance. For 824 out of the 831 bigrams thatoccurred 20 times in our corpus the null hypothesis of independence canbe rejected. But we would only classify a fraction as true collocations. Thereason for this surprisingly high proportion of possibly dependent bigrams(/8/2/4/8/3/1
/#19 /0 /: /9/9) is that language – if compared with a random word genera-
tor – is very regular so that few completely unpredictable events happen.Indeed, this is the basis of our ability to perform tasks like word sense dis-ambiguation and probabilistic parsing that we discuss in other chapters.

156 5 Collocations
The ttest and other statistical tests are most useful as a method for ranking
collocations. The level of signiﬁcance itself is less useful. In fact, in mostpublications that we cite in this chapter, the level of signiﬁcance is neverlooked at. All that is used is the scores and the resulting ranking.
5.3.2 Hypothesis testing of differences
The ttest can also be used for a slightly different collocation discovery
problem: to ﬁnd words whose co-occurrence patterns best distinguish be-tween two words. For example, in computational lexicography we maywant to ﬁnd the words that best differentiate the meanings of strong and
powerful . This use of thettest was suggested by Church and Hanks (1989).
Table 5.7 shows the ten words that occur most signiﬁcantly more often withpowerful than with strong (ﬁrst ten words) and most signiﬁcantly more of-
ten with strong than with powerful (second set of ten words).
Thetscores are computed using the following extension of the ttest to
the comparison of the means of two normal populations:t /=
/#16 x/1
/, /#16 x/2qs/1
/2n/1
/+
s/2
/2n/2(5.4)
Here the null hypothesis is that the average difference is /0( /#16 /=/0), so we
have /#16 x /, /#16 /=/#16 x /=
/1N
P/#28 x/1i
/, x/2i
/#29/= /#16 x/1
/, /#16 x/2. In the denominator we add
the variances of the two populations since the variance of the difference oftwo random variables is the sum of their individual variances.
Now we can explain Table 5.7. Thetvalues in the table were computed
assuming a Bernoulli distribution (as we did for the basic version of the t
test that we introduced ﬁrst). If wis the collocate of interest (e.g., computers
orsymbol ) and v
/1and v
/2are the words we are comparing (e.g., powerful
and strong ), then we have /#16 x/1
/= s
/2/1
/= P /#28 v
/1w /#29, /#16 x/2
/= s
/2/2
/= P /#28 v
/2w /#29. We again
use the approximation s
/2/= p /, p
/2/#19 p:t /#19
P /#28 v
/1w /#29 /, P /#28 v
/2w /#29qP /#28 v
/1w /#29/+ P /#28 v
/2w /#29N
We can simplify this as follows.t /#19
C /#28 v
/1w /#29N
/,
C /#28 v
/2w /#29NqC /#28 v
/1w /#29/+ C /#28 v
/2w /#29N
/2/=
C /#28 v
/1w /#29 /, C /#28 v
/2w /#29pC /#28 v
/1w /#29/+ C /#28 v
/2w /#29(5.5)

5.3 Hypothesis Testing 157t C /#28 w /#29 C /#28strong w) C /#28powerful w) word
3.1622 933 0 10 computers
2.8284 2337 0 8 computer2.4494 289 0 6 symbol2.4494 588 0 6 machines2.2360 2266 0 5 Germany2.2360 3745 0 5 nation2.2360 395 0 5 chip2.1828 3418 4 13 force
2.0000 1403 0 4 friends
2.0000 267 0 4 neighbor
7.0710 3685 50 0 support6.3257 3616 58 7 enough4.6904 986 22 0 safety
4.5825 3741 21 0 sales
4.0249 1093 19 1 opposition3.9000 802 18 1 showing3.9000 1641 18 1 sense3.7416 2501 14 0 defense3.6055 851 13 0 gains3.6055 832 13 0 criticism
Table 5.7 Words that occur signiﬁcantly more often with powerful (the ﬁrst ten
words) and strong (the last ten words).
where C /#28 x /#29is the number of times xoccurs in the corpus.
The application suggested by Church and Hanks (1989) for this form of
the ttest was lexicography. The data in Table 5.7 are useful to a lexicogra-
pher who wants to write precise dictionary entries that bring out the differ-ence between strong and powerful . Based on signiﬁcant collocates, Church
and Hanks analyze the difference as a matter of intrinsic vs. extrinsic qual-ity. For example, strong support from a demographic group means that the
group is very committed to the cause in question, but the group may nothave any power. So strong describes an intrinsic quality. Conversely, a pow-
erful supporter is somebody who actually has the power to move things.
Many of the collocates we found in our corpus support Church and Hanks’analysis. But there is more complexity to the difference in meaning be-tween the two words since what is extrinsic and intrinsic can depend on
subtle matters like cultural attitudes. For example, we talk about strong tea

158 5 Collocationsw/1
/=new w/1
/6/=neww/2
/=companies 8 4667
(new companies ) (e.g., old companies )w/2
/6/=companies 15820 14287181
(e.g., new machines ) (e.g., old machines )
Table 5.8 A 2-by-2 table showing the dependence of occurrences of new and com-
panies . There are 8 occurrences of new companies in the corpus, 4667 bigrams where
the second word is companies , but the ﬁrst word is not new, 15,820 bigrams with the
ﬁrst word new and a second word different from companies , and 14,287,181 bigrams
that contain neither word in the appropriate position.
on the one hand and powerful drugs on the other, a difference that tells us
more about our attitude towards tea and drugs than about the semanticsof the two adjectives (Church et al. 1991: 133).
5.3.3 Pearson’s chi-square test
Use of the ttest has been criticized because it assumes that probabilities are
approximately normally distributed, which is not true in general (Churchand Mercer 1993: 20). An alternative test for dependence which does notassume normally distributed probabilities is the/#1F
/2test (pronounced “chi-
square test”). In the simplest case, the /#1F
/2test is applied to 2-by-2 tables like
Table 5.8. The essence of the test is to compare the observed frequencies inthe table with the frequencies expected for independence. If the differencebetween observed and expected frequencies is large, then we can reject thenull hypothesis of independence.
Table 5.8 shows the distribution of new and companies in the reference
corpus that we introduced earlier. Recall thatC /#28new /#29/=/1 /5 /; /8/2/8, C /#28companies /#29/=/4 /; /6/7/5, C /#28new companies /#29 /= /8, and that there are 14,307,668 tokens in the
corpus. That means that the number of bigrams wi
wi /+/1with the ﬁrst to-
ken being new and the second token not being companies is /4/6/6/7 /= /4/6/7/5 /, /8.
The two cells in the bottom row are computed in a similar way.
The /#1F
/2statistic sums the differences between observed and expected val-
ues in all squares of the table, scaled by the magnitude of the expectedvalues, as follows:X
/2/=
Xi/;j
/#28 Oij
/, Eij
/#29
/2Eij(5.6)
where iranges over rows of the table, jranges over columns, Oijis the

5.3 Hypothesis Testing 159
observed value for cell /#28 i/; j /#29and Eijis the expected value.
One can show that the quantity X
/2is asymptotically /#1F
/2distributed. In
other words, if the numbers are large, then X
/2has a /#1F
/2distribution. We
will return to the issue of how good this approximation is later.
The expected frequencies Eijare computed from the marginal probabili-
ties, that is from the totals of the rows and columns converted into propor-tions. For example, the expected frequency for cell/#28/1 /; /1/#29(new companies )
would be the marginal probability of new occurring as the ﬁrst part of a bi-
gram times the marginal probability of companies occurring as the second
part of a bigram (multiplied by the number of bigrams in the corpus):/8 /+ /4/6/6/7N
/#02
/8 /+ /1/5/8/2/0N
/#02 N /#19 /5 /: /2
That is, if new and companies occurred completely independently of each
other we would expect /5 /: /2occurrences of new companies on average for a
text of the size of our corpus.
The /#1F
/2test can be applied to tables of any size, but it has a simpler form
for 2-by-2 tables: (see Exercise 5-9)/#1F
/2/=
N /#28 O/1/1
O/2/2
/, O/1/2
O/2/1
/#29
/2/#28 O/1/1
/+ O/1/2
/#29/#28 O/1/1
/+ O/2/1
/#29/#28 O/1/2
/+ O/2/2
/#29/#28 O/2/1
/+ O/2/2
/#29(5.7)
This formula gives the following /#1F
/2value for Table 5.8:/1/4/3/0/7/6/6/8/#28/8 /#02 /1/4/2/8/7/1/8/1 /, /4/6/6/7 /#02 /1/5/8/2/0/#29
/2/#28/8 /+ /4/6/6/7/#29/#28/8 /+ /1/5/8/2/0/#29/#28/4/6/6/7 /+ /1/4/2/8/7/1/8/1/#29/#28/1/5/8/2/0 /+ /1/4/2/8/7/1/8/1/#29
/#19 /1 /: /5/5
Looking up the /#1F
/2distribution in the appendix, we ﬁnd that at a probabil-
ity level of /#0B /=/0 /: /0/5the critical value is /#1F
/2/=/3 /: /8/4/1. (the statistic has one
degree of freedom for a 2-by-2 table). So we cannot reject the null hypoth-
esis that new and companies occur independently of each other. Thus new
companies is not a good candidate for a collocation.
This result is the same as we got with the tstatistic. In general, for the
problem of ﬁnding collocations, the differences between the tstatistic and
the /#1F
/2statistic do not seem to be large. For example, the 20 bigrams with
the highest tscores in our corpus are also the 20 bigrams with the highest/#1F
/2scores.
However, the /#1F
/2test is also appropriate for large probabilities, for which
the normality assumption of the ttest fails. This is perhaps the reason that
the /#1F
/2test has been applied to a wider range of problems in collocation
discovery.

160 5 Collocations
cow /:cow
vache 59 6/:vache 8 570934
Table 5.9 Correspondence of vache and cowin an aligned corpus. By applying the/#1F
/2test to this table one can determine whether vache and cow are translations of
each other.
corpus 1 corpus 2
word 1 60 9
word 2 500 76
word 3 124 20
…
Table 5.10 Testing for the independence of words in different corpora using /#1F
/2.
This test can be used as a metric for corpus similarity.
One of the early uses of the /#1F
/2test in Statistical NLP was the identiﬁ-
cation of translation pairs in aligned corpora (Church and Gale 1991b).5
The data in Table 5.9 (from a hypothetical aligned corpus) strongly suggestthat vache is the French translation of English cow. Here, 59 is the number
of aligned sentence pairs which have cowin the English sentence and vache
in the French sentence etc. The/#1F
/2value is very high here: /#1F
/2/= /4/5/6/4/0/0 .S o
we can reject the null hypothesis that cowand vache occur independently
of each other with high conﬁdence. This pair is a good candidate for a
translation pair.
An interesting application of /#1F
/2is as a metric for corpus similarity (Kil-
garriff and Rose 1998). Here we compile an n-by-two table for a large n,
for example n /= /5/0/0. The two columns correspond to the two corpora.
Each row corresponds to a particular word. This is schematically shownin Table 5.10. If the ratio of the counts are about the same (as is the casein Table 5.10, each word occurs roughly 6 times more often in corpus 1than in corpus 2), then we cannot reject the null hypothesis that both cor-pora are drawn from the same underlying source. We can interpret this asa high degree of similarity. On the other hand, if the ratios vary wildly,then theX
/2score will be high and we have evidence for a high degree of
dissimilarity.
5. They actually use a measure they call /#1E
/2, which is X
/2multiplied by N. They do this since
they are only interested in ranking translation pairs, so that assessment of signiﬁcance is not
important.

5.3 Hypothesis Testing 161H/1
H/2P /#28 w
/2j w
/1/#29 p /=
c/2N
p/1
/=
c/1/2c/1P /#28 w
/2j/: w
/1/#29 p /=
c/2N
p/2
/=
c/2
/, c/1/2N /, c/1c/1/2out of c/1bigrams are w
/1w
/2b/#28 c/1/2
/; c/1
/;p /#29 b/#28 c/1/2
/; c/1
/;p/1
/#29c/2
/, c/1/2out of N /, c/1bigrams are /: w
/1w
/2b/#28 c/2
/, c/1/2
/; N /, c/1
/;p /#29 b/#28 c/2
/, c/1/2
/; N /, c/1
/;p/2
/#29
Table 5.11 How to compute Dunning’s likelihood ratio test. For example, the
likelihood of hypothesis H/2is the product of the last two lines in the rightmost
column.
Just as application of the ttest is problematic because of the underlying
normality assumption, so is application of /#1F
/2in cases where the numbers
in the 2-by-2 table are small. Snedecor and Cochran (1989: 127) adviseagainst using/#1F
/2if the total sample size is smaller than 20 or if it is between
20 and 40 and the expected value in any of the cells is 5 or less. In general,the test as described here can be inaccurate if expected cell values are small(Read and Cressie 1988), a problem we will return to below.
5.3.4 Likelihood Ratios
Likelihood ratios are another approach to hypothesis testing. We will seebelow that they are more appropriate for sparse data than the/#1F
/2test. But
they also have the advantage that the statistic we are computing, a likelihood LIKELIHOOD RATIO
ratio , is more interpretable than the X
/2statistic. It is simply a number that
tells us how much more likely one hypothesis is than the other.
In applying the likelihood ratio test to collocation discovery, we examine
the following two alternative explanations for the occurrence frequency ofa bigramw
/1w
/2(Dunning 1993):/#0FHypothesis 1. P /#28 w
/2j w
/1/#29/= p /= P /#28 w
/2j/: w
/1/#29/#0FHypothesis 2. P /#28 w
/2j w
/1/#29/= p/1
/6/= p/2
/= P /#28 w
/2j/: w
/1/#29
Hypothesis 1 is a formalization of independence (the occurrence of w
/2is
independent of the previous occurrence of w
/1), Hypothesis 2 is a formaliza-
tion of dependence which is good evidence for an interesting collocation.6
We use the usual maximum likelihood estimates for p, p/1and p/2and
write c/1, c/2, and c/1/2for the number of occurrences of w
/1, w
/2and w
/1w
/2in
6. We assume that p/1
/#1D p/2if Hypothesis 2 is true. The case p/1
/#1C p/2is rare and we will
ignore it here.

162 5 Collocations
the corpus:p /=
c/2N
p/1
/=
c/1/2c/1
p/2
/=
c/2
/, c/1/2N /, c/1(5.8)
Assuming a binomial distribution:b/#28 k /; n/; x /#29/=
/#12nk
/#13x
k/#28/1 /, x /#29
/#28 n /, k /#29(5.9)
the likelihood of getting the counts for w
/1, w
/2and w
/1w
/2that we actually
observed is then L /#28 H/1
/#29/=b /#28 c/1/2
/; c/1
/;p /#29b/#28 c/2
/, c/1/2
/; N /, c/1
/;p /#29for Hypothesis 1
and L /#28 H/2
/#29/= b /#28 c/1/2
/; c/1
/;p/1
/#29b/#28 c/2
/, c/1/2
/; N /, c/1
/;p/2
/#29for Hypothesis 2. Table 5.11
summarizes this discussion. One obtains the likelihoods L /#28 H/1
/#29and L /#28 H/2
/#29
just given by multiplying the last two lines, the likelihoods of the speciﬁed
number of occurrences of w
/1w
/2and /: w
/1w
/2, respectively.
The log of the likelihood ratio /#15is then as follows:log /#15 /= log
L /#28 H/1
/#29L /#28 H/2
/#29(5.10)/= log
b /#28 c/1/2
/;c/1
/;p /#29 b /#28 c/2
/, c/1/2
/;N /, c/1
/;p /#29b /#28 c/1/2
/;c/1
/;p/1
/#29 b /#28 c/2
/, c/1/2
/;N /, c/1
/;p/2
/#29/= log L /#28 c/1/2
/;c/1
/;p /#29 /+ log L /#28 c/2
/, c/1/2
/;N /, c/1
/;p /#29/, log L /#28 c/1/2
/;c/1
/;p/1
/#29 /, log L /#28 c/2
/, c/1/2
/;N /, c/1
/;p/2
/#29
where L /#28 k /; n/; x /#29/= x
k/#28/1 /, x /#29
n /, k.
Table 5.12 shows the twenty bigrams of powerful which are highest ranked
according to the likelihood ratio when the test is applied to the New York
Times corpus. We will explain below why we show the quantity /, /2 log /#15
instead of /#15. We consider all occurring bigrams here, including rare ones
that occur less than six times, since this test works well for rare bigrams.For example, powerful cudgels , which occurs 2 times, is identiﬁed as a pos-
sible collocation.
One advantage of likelihood ratios is that they have a clear intuitive in-
terpretation. For example, the bigram powerful computers ise
/0 /: /5 /#02 /8/2 /: /9/6/#19/1 /: /3 /#02 /1/0
/1/8times more likely under the hypothesis that computers is more
likely to follow powerful than its base rate of occurrence would suggest.
This number is easier to interpret than the scores of the ttest or the /#1F
/2test
which we have to look up in a table.
But the likelihood ratio test also has the advantage that it can be more
appropriate for sparse data than the /#1F
/2test. How do we use the likeli-
hood ratio for hypothesis testing? If /#15is a likelihood ratio of a particular
form, then the quantity /, /2 log /#15is asymptotically /#1F
/2distributed (Mood

5.3 Hypothesis Testing 163/, /2 log /#15 C /#28 w
/1/#29 C /#28 w
/2/#29 C /#28 w
/1w
/2/#29 w
/1w
/2
1291.42 12593 932 150 most powerful
99.31 379 932 10 politically powerful
82.96 932 934 10 powerful computers
80.39 932 3424 13 powerful force
57.27 932 291 6 powerful symbol
51.66 932 40 4 powerful lobbies
51.52 171 932 5 economically powerful
51.05 932 43 4 powerful magnet
50.83 4458 932 10 less powerful
50.75 6252 932 11 very powerful
49.36 932 2064 8 powerful position
48.78 932 591 6 powerful machines
47.42 932 2339 8 powerful computer
43.23 932 16 3 powerful magnets
43.10 932 396 5 powerful chip
40.45 932 3694 8 powerful men
36.36 932 47 3 powerful 486
36.15 932 268 4 powerful neighbor
35.24 932 5245 8 powerful political
34.15 932 3 2 powerful cudgels
Table 5.12 Bigrams of powerful with the highest scores according to Dunning’s
likelihood ratio test.
et al. 1974: 440). So we can use the values in Table 5.12 to test the null hy-
pothesis H/1against the alternative hypothesis H/2. For example, we can
look up the value of /3/4 /: /1/5forpowerful cudgels in the table and reject H/1
for this bigram on a conﬁdence level of /#0B /=/0 /: /0/0/5. (The critical value (for
one degree of freedom) is 7.88. See the table of the /#1F
/2distribution in the
appendix.)
The particular form of the likelihood ratio that is required here is that
of a ratio between the maximum likelihood estimate over a subpart of the
parameter space and the maximum likelihood estimate over the entire pa-
rameter space. For the likelihood ratio in (5.11), this space is the space ofpairs/#28 p/1
/;p/2
/#29for the probability of w
/2occurring when w
/1preceded ( p/1) andw
/2occurring when a different word preceded ( p/2). We get the maximum
likelihood for the data we observed if we assume the maximum likelihoodestimates that we computed in (5.8). The subspace is the subset of casesfor whichp/1
/= p/2. Again, the estimate in (5.8) gives us the maximum

164 5 Collocations
ratio 1990 1989 w
/1w
/2
0.0241 2 68 Karim Obeid
0.0372 2 44 East Berliners0.0372 2 44 Miss Manners0.0399 2 41 17 earthquake0.0409 2 40 HUD ofﬁcials0.0482 2 34 EAST GERMANS0.0496 2 33 Muslim cleric0.0496 2 33 John Le
0.0512 2 32 Prague Spring
0.0529 2 31 Among individual
Table 5.13 Damerau’s frequency ratio test. Ten bigrams that occurred twice in the
1990 New York Times corpus, ranked according to the (inverted) ratio of relativefrequencies in 1989 and 1990.
likelihood over the subspace given the data we observed. It can be shown
that if /#15is a ratio of two likelihoods of this type (one being the maximum
likelihood over the subspace, the other over the entire space), then /, /2 log /#15
is asymptotically /#1F
/2distributed. “Asymptotically” roughly means “if the
numbers are large enough”. Whether or not the numbers are large enoughin a particular case is hard to determine, but Dunning has shown that forsmall counts the approximation to/#1F
/2is better for the likelihood ratio in
(5.11) than, for example, for the X
/2statistic in (5.6). Therefore, the like-
lihood ratio test is in general more appropriate than Pearson’s /#1F
/2test for
collocation discovery.7
Relative frequency ratios So far we have looked at evidence for collo-
cations within one corpus. Ratios of relative frequencies between two or RELATIVE FREQUENCIES
more different corpora can be used to discover collocations that are char-
acteristic of a corpus when compared to other corpora (Damerau 1993).Although ratios of relative frequencies do not ﬁt well into the hypothesistesting paradigm, we treat them here since they can be interpreted as like-lihood ratios.
Table 5.13 shows ten bigrams that occur exactly twice in our reference
corpus (the 1990 New York Times corpus). The bigrams are ranked accord-
ing to the ratio of their relative frequencies in our 1990 reference corpus
7. However, even /, /2 log /#15is not approximated well by /#1F
/2if the expected values in the 2-by-2
contingency table are less than 1.0 (Read and Cressie 1988; Pedersen 1996).

5.3 Hypothesis Testing 165
versus their frequencies in a 1989 corpus (again drawn from the months
August through November). For example, Karim Obeid occurs 68 times in
the 1989 corpus. So the relative frequency ratio ris:r /=
/2/1/4/3/0/7/6/6/8/6/8/1/1/7/3/1/5/6/4
/#19 /0 /: /0/2/4/1/1/6
The bigrams in the Table are mostly associated with news items that
were more prevalent in 1989 than in 1990: The Muslim cleric Sheik AbdulKarim Obeid (who was abducted in 1989), the disintegration of commu-nist Eastern Europe ( East Berliners ,EAST GERMANS ,Prague Spring ), the
novel The Russia House byJohn Le Carre , a scandal in the Department of
Housing and Urban Development (HUD), and the October 17 earthquakein the San Francisco Bay Area. But we also ﬁnd artefacts like Miss Manners
(whose column the New York Times News Wire stopped carrying in 1990)and Among individual . The reporter Phillip H. Wiggins liked to use the lat-
ter phrase for his stock market reports ( Among individual Big Board issues
…), but he stopped writing for the Times in 1990.
The examples show that frequency ratios are mainly useful to ﬁnd subject-
speciﬁc collocations. The application proposed by Damerau is to compare a
general text with a subject-speciﬁc text. Those words and phrases that ona relative basis occur most often in the subject-speciﬁc text are likely to bepart of the vocabulary that is speciﬁc to the domain.
Exercise 5-4
Identify the most signiﬁcantly non-independent bigrams according to the ttest in
a corpus of your choice.
Exercise 5-5
It is a coincidence that the tvalue for new companies is close to 1.0. Show this by
computing the tvalue of new companies for a corpus with the following counts.C /#28new /#29/=/3 /0 /; /0/0/0, C /#28companies /#29/= /9 /; /0/0/0, C /#28new companies /#29/= /2 /0 , and corpus sizeN /=/1 /5 /; /0/0/0 /; /0/0/0.
Exercise 5-6
We can also improve on the method in the previous section (Section 5.2) by taking
into account variance. In fact, Smadja does this and the algorithm described in
(Smadja 1993) therefore bears some similarity to the ttest.
Compute the tstatistic in equation (5.3) for possible collocations by substituting
mean and variance as computed in Section 5.2 for /#16 xand s
/2and a) assuming /#16 /=/0,
and b) assuming /#16 /=round /#28/#16 x /#29, that is, the closest integer. Note that we are not
testing for bigrams here, but for collocations of word pairs that occur at any ﬁxed
small distance.
Exercise 5-7
As we pointed out above, almost all bigrams occur signiﬁcantly more often than

166 5 Collocations
chance if a stop list is used for preﬁltering. Verify that there is a large proportion of
bigrams that occur less often than chance if we do not ﬁlter out function words.
Exercise 5-8
Apply the ttest of differences to a corpus of your choice. Work with the following
word pairs or with word pairs that are appropriate for your corpus: man /woman ,
blue /green ,lawyer /doctor .
Exercise 5-9
Derive (5.7) from (5.6).
Exercise 5-10
Find terms that distinguish best between ﬁrst and second part of a corpus of your
choice.
Exercise 5-11
Repeat the above exercise with random selection. Now you should ﬁnd that fewer
terms are signiﬁcant. But some still are. Why? Shouldn’t there be no differences
between corpora drawn from the same source? Do this exercise for different signif-
icance levels.
Exercise 5-12
Compute a measure of corpus similarity between two corpora of your choice.
Exercise 5-13
Kilgarriff and Rose’s corpus similarity measure can also be used for assessing cor-
pus homogeneity. This is done by constructing a series of random divisions of the
corpus into a pair of subcorpora. The test is then applied to each pair. If most of
the tests indicated similarity, then it is a homogeneous corpus. Apply this test to a
corpus of your choice.
5.4 Mutual Information
An information-theoretically motivated measure for discovering interest-
ing collocations is pointwise mutual information (Church et al. 1991; Church POINTWISE MUTUAL
INFORMATION and Hanks 1989; Hindle 1990). Fano (1961: 27–28) originally deﬁned mu-
tual information between particular events x
/0and y
/0, in our case the occur-
rence of particular words, as follows:I /#28 x
/0/;y
/0/#29 /= log/2
P /#28 x
/0y
/0/#29P /#28 x
/0/#29 P /#28 y
/0/#29(5.11)/= log/2
P /#28 x
/0j y
/0/#29P /#28 x
/0/#29(5.12)/= log/2
P /#28 y
/0j x
/0/#29P /#28 y
/0/#29(5.13)

5.4 Mutual Information 167I /#28 w
/1/;w
/2/#29 C /#28 w
/1/#29 C /#28 w
/2/#29 C /#28 w
/1w
/2/#29 w
/1w
/2
18.38 42 20 20 Ayatollah Ruhollah
17.98 41 27 20 Bette Midler
16.31 30 117 20 Agatha Christie
15.94 77 59 20 videocassette recorder
15.19 24 320 20 unsalted butter
1.09 14907 9017 20 ﬁrst made
1.01 13484 10570 20 over many
0.53 14734 13478 20 into them
0.46 14093 14776 20 like people
0.29 15019 15629 20 time last
Table 5.14 Finding collocations: Ten bigrams that occur with frequency 20,
ranked according to mutual information.
This type of mutual information, which we introduced in Section 2.2.3, is
roughly a measure of how much one word tells us about the other, a notionthat we will make more precise shortly.
In information theory, mutual information is more often deﬁned as hold-
ing between random variables , not values of random variables as we have de-
ﬁned it here (see the standard deﬁnition in Section 2.2.3). We will see belowthat these two types of mutual information are quite different creatures.
When we apply this deﬁnition to the 10 collocations from Table 5.6, we
get the same ranking as with thettest (see Table 5.14). As usual, we use
maximum likelihood estimates to compute the probabilities, for example:I /#28Ayatollah /;Ruhollah /#29 /= log/2
/2/0/1/4/3/0/7/6/6/8/4/2/1/4/3/0/7/6/6/8
/#02
/2/0/1/4/3/0/7/6/6/8
/#19 /1/8 /: /3/8
So what exactly is (pointwise) mutual information, I /#28 x
/0/;y
/0/#29, a measure of?
Fano writes about deﬁnition (5.12):
The amount of information provided by the occurrence of the
event represented by [ y
/0] about the occurrence of the event rep-
resented by [ x
/0] is deﬁned as [(5.12)].
For example, the mutual information measure tells us that the amount of
information we have about the occurrence of Ayatollah at position iin the
corpus increases by 18.38 bits if we are told that Ruhollah occurs at posi-
tion i /+/1. Or, since (5.12) and (5.13) are equivalent, it also tells us that the
amount of information we have about the occurrence of Ruhollah at posi-
tion i /+/1in the corpus increases by 18.38 bits if we are told that Ayatollah

168 5 Collocations
chambre /:chambre
house 31,950 12,004/:house 4,793 848,330MI
4.1
/#1F
/2
553610
communes /:communes
house 4,974 38,980/:house 441 852,682 4.2 88405
Table 5.15 Correspondence of chambre and house and communes and house in
the aligned Hansard corpus. Mutual information gives a higher score to ( com-
munes ,house ), while the /#1F
/2test gives a higher score to the correct translation pair
(chambre ,house ).
occurs at position i. We could also say that our uncertainty is reduced by
18.38 bits. In other words, we can be much more certain that Ruhollah will
occur next if we are told that Ayatollah is the current word.
Unfortunately, this measure of “increased information” is in many cases
not a good measure of what an interesting correspondence between twoevents is, as has been pointed out by many authors. (We base our discus-
sion here mainly on (Church and Gale 1991b) and (Maxwell III 1992).) Con-
sider the two examples in Table 5.15 of counts of word correspondencesbetween French and English sentences in the Hansard corpus, an alignedcorpus of debates of the Canadian parliament (the table is similar to Ta-ble 5.9). The reason that house frequently appears in translations of French
sentences containing chambre and communes is that the most common use
ofhouse in the Hansard is the phrase House of Commons which corresponds
toChambre de communes in French. But it is easy to see that communes is a
worse match for house than chambre since most occurrences of house occur
without communes on the French side. As shown in the table, the/#1F
/2test is
able to infer the correct correspondence whereas mutual information givespreference to the incorrect pair ( communes ,house ).
We can explain the difference between the two measures easily if we look
at deﬁnition (5.12) of mutual information and compareI /#28chambre /;house /#29
and I /#28communes /;house /#29:log
P /#28house jchambre /#29P /#28house /#29
/= log
/3/1/9/5/0/3/1/9/5/0/+/4/7/9/3P /#28house /#29
/#19 log
/0 /: /8/7P /#28house /#29/#3C log
/0 /: /9/2P /#28house /#29
/#19 log
/4/9/7/4/4/9/7/4/+/4/4/1P /#28house /#29
/= log
P /#28house jcommunes /#29P /#28house /#29
The word communes in the French makes it more likely that house occurred
in the English than chambre does. The higher mutual information value for

5.4 Mutual Information 169I/1/0/0/0
w
/1w
/2w
/1w
/2bigram I/2/3/0/0/0
w
/1w
/2w
/1w
/2bigram
16.95 5 1 1 Schwartz eschews 14.46 106 6 1 Schwartz eschews
15.02 1 19 1 fewest visits 13.06 76 22 1 FIND GARDEN
13.78 5 9 1 FIND GARDEN 11.25 22 267 1 fewest visits
12.00 5 31 1 Indonesian pieces 8.97 43 663 1 Indonesian pieces
9.82 26 27 1 Reds survived 8.04 170 1917 6 marijuana growing
9.21 13 82 1 marijuana growing 5.73 15828 51 3 new converts
7.37 24 159 1 doubt whether 5.26 680 3846 7 doubt whether
6.68 687 9 1 new converts 4.76 739 713 1 Reds survived
6.00 661 15 1 like offensive 1.95 3549 6276 6 must think
3.81 159 283 1 must think 0.41 14093 762 1 like offensive
Table 5.16 Problems for Mutual Information from data sparseness. The table
shows ten bigrams that occurred once in the ﬁrst 1000 documents in the referencecorpus ranked according to mutual information score in the ﬁrst 1000 documents(left half of the table) and ranked according to mutual information score in the en-tire corpus (right half of the table). These examples illustrate that a large proportionof bigrams are not well characterized by corpus data (even for large corpora) andthat mutual information is particularly sensitive to estimates that are inaccuratedue to sparseness.
communes reﬂects the fact that communes causes a larger decrease in uncer-
tainty here. But as the example shows decrease in uncertainty does notcorrespond well to what we want to measure. In contrast, the/#1F
/2is a direct
test of probabilistic dependence, which in this context we can interpret asthe degree of association between two words and hence as a measure oftheir quality as translation pairs and collocations.
Table 5.16 shows a second problem with using mutual information for
ﬁnding collocations. We show ten bigrams that occur exactly once in theﬁrst 1000 documents of the reference corpus and their mutual informa-tion score based on the 1000 documents. The right half of the table showsthe mutual information score based on the entire reference corpus (about23,000 documents).
The larger corpus of 23,000 documents makes some better estimates pos-
sible, which in turn leads to a slightly better ranking. The bigrams mari-
juana growing and new converts (arguably collocations) have moved up and
Reds survived (deﬁnitely not a collocation) has moved down. However,
what is striking is that even after going to a 10 times larger corpus 6 ofthe bigrams still only occur once and, as a consequence, have inaccuratemaximum likelihood estimates and artiﬁcially inﬂated mutual informationscores. All 6 are not collocations and we would prefer a measure which

170 5 Collocations
ranks them accordingly.
None of the measures we have seen works very well for low-frequency
events. But there is evidence that sparseness is a particularly difﬁcult prob-lem for mutual information. To see why, notice that mutual informationis a log likelihood ratio of the probability of the bigramP /#28 w
/1w
/2/#29and the
product of the probabilities of the individual words P /#28 w
/1/#29 P /#28 w
/2/#29. Consider
two extreme cases: perfect dependence of the occurrences of the two words(they only occur together) and perfect independence (the occurrence of onedoes not give us any information about the occurrence of the other). Forperfect dependence we have:I /#28 x/; y /#29 /= log
P /#28 xy /#29P /#28 x /#29 P /#28 y /#29
/= log
P /#28 x /#29P /#28 x /#29 P /#28 y /#29
/= log
/1P /#28 y /#29
That is, among perfectly dependent bigrams, as they get rarer, their mutual
information increases .
For perfect independence we have:I /#28 x/; y /#29 /= log
P /#28 xy /#29P /#28 x /#29 P /#28 y /#29
/= log
P /#28 x /#29 P /#28 y /#29P /#28 x /#29 P /#28 y /#29
/= log /1 /= /0
We can say that mutual information is a good measure of independence.
Values close to 0 indicate independence (independent of frequency). But
it is a bad measure of dependence because for dependence the score de-
pends on the frequency of the individual words. Other things being equal,bigrams composed of low-frequency words will receive a higher score thanbigrams composed of high-frequency words. That is the opposite of whatwe would want a good measure to do since higher frequency means moreevidence and we would prefer a higher rank for bigrams for whose inter-estingness we have more evidence. One solution that has been proposedfor this is to use a cutoff and to only look at words with a frequency of atleast 3. However, such a move does not solve the underlying problem, butonly ameliorates its effects.
Since pointwise mutual information does not capture the intuitive no-
tion of an interesting collocation very well, it is often not used when itis made available in practical applications (Fontenelle et al. 1994: 81) or itis redeﬁned asC /#28 w
/1w
/2/#29 I /#28 w
/1/;w
/2/#29to compensate for the bias of the origi-
nal deﬁntion in favor of low-frequency events (Fontenelle et al. 1994: 72,
Hodges et al. 1996).
As we mentioned earlier, the deﬁnition of mutual information used here
is common in corpus linguistic studies, but is less common in InformationTheory. Mutual information in Information Theory refers to the expectation
EXPECTATION

5.4 Mutual Information 171
terminology
symbol deﬁnition current use FanoI /#28 x/; y /#29 log
p/#28 x/;y /#29p/#28 x /#29p /#28 y /#29pointwise mutual information mutual informationI /#28 X /; Y /#29 E log
p/#28 X/; Y /#29p/#28 X /#29p /#28 Y /#29mutual information average MI
expectation of MI
Table 5.17 Different deﬁnitions of mutual information in (Cover and Thomas 1991)
and (Fano 1961).
of the quantity that we have used in this section:I /#28 X /; Y /#29/= Ep/#28 x/;y /#29
log
p/#28 X/; Y /#29p/#28 X /#29p /#28 Y /#29
The deﬁnition we have used in this chapter is an older one, termed point-
wise mutual information (see Section 2.2.3, Fano 1961: 28, and Gallager1968). Table 5.17 summarizes the older and newer naming conventions.One quantity is the expectation of the other, so the two types of mutualinformation are quite different.
The example of mutual information demonstrates what should be self-
evident: it is important to check what a mathematical concept is a formal-ization of. The notion of pointwise mutual information that we have used
here/,log
p/#28 w
/1w
/2/#29p/#28 w
/1/#29p /#28 w
/2/#29
/#01
measures the reduction of uncertainty about the oc-
currence of one word when we are told about the occurrence of the other.As we have seen, such a measure is of limited utility for acquiring the typesof linguistic properties we have looked at in this section.
Exercise 5-14
Justeson and Katz’s part-of-speech ﬁlter in Section 5.1 can be applied to any of the
other methods of collocation discovery in this chapter. Pick one and modify it to
incorporate a part-of-speech ﬁlter. What advantages does the modiﬁed method
have?
Exercise 5-15
Design and implement a collocation discovery tool for a translator’s workbench.
Pick either one method or a combination of methods that the translator can choose
from.
Exercise 5-16
Design and implement a collocation discovery tool for a lexicographer’s work-
bench. Pick either one method or a combination of methods that the lexicographer
can choose from.
Exercise 5-17

172 5 Collocations
Many news services tag references to companies in their news stories. For exam-
ple, all references to the General Electric Company would be tagged with the same
tag regardless of which variant of the name is used (e.g., GE,General Electric ,o r
General Electric Company ). Design and implement a collocation discovery tool for
ﬁnding company names. How could one partially automate the process of identi-
fying variants?
5.5 The Notion of Collocation
The notion of collocation may be confusing to readers without a back-
ground in linguistics. We will devote this section to discussing in moredetail what a collocation is.
There are actually different deﬁnitions of the notion of collocation. Some
authors in the computational and statistical literature deﬁne a collocationas two or more consecutive words with a special behavior, for example
Choueka (1988):
[A collocation is deﬁned as] a sequence of two or more consec-
utive words, that has characteristics of a syntactic and semanticunit, and whose exact and unambiguous meaning or connotationcannot be derived directly from the meaning or connotation of itscomponents.
Most of the examples we have presented in this chapter also assumed
adjacency of words. But in most linguistically oriented research, a phrasecan be a collocation even if it is not consecutive (as in the example knock
. . . door ). The following criteria are typical of linguistic treatments of col-
locations (see for example Benson (1989) and Brundage et al. (1992)), non-
compositionality being the main one we have relied on here./#0FNon-compositionality. The meaning of a collocation is not a straight-
forward composition of the meanings of its parts. Either the meaningis completely different from the free combination (as in the case of id-ioms like kick the bucket ) or there is a connotation or added element of
meaning that cannot be predicted from the parts. For example, white
wine ,white hair and white woman all refer to slightly different colors, so
we can regard them as collocations./#0FNon-substitutability. We cannot substitute near-synonyms for the
components of a collocation. For example, we can’t say yellow wine
instead of white wine even though yellow is as good a description of the
color of white wine as white is (it is kind of a yellowish white).

5.5 The Notion of Collocation 173
strength power
to build up /#18 to assume /#18
to ﬁnd /#18 emergency /#18
to save /#18 discretionary /#18
to sap somebody’s /#18 /#18over [several provinces]
brute /#18 supernatural /#18
tensile /#18 to turn off the /#18
the /#18to [do X] the /#18to [do X]/#5Bour staff was /#5Dat full /#18 the balance of /#18
on the /#18of [your recommendation] ﬁre /#18
Table 5.18 Collocations in the BBI Combinatory Dictionary of English./#0FNon-modiﬁability. Many collocations cannot be freely modiﬁed with
additional lexical material or through grammatical transformations.This is especially true for frozen expressions like idioms. For example,we can’t modify frog into get a frog in one’s throat into to get an ugly
frog in one’s throat although usually nouns like frogcan be modiﬁed by
adjectives like ugly. Similarly, going from singular to plural can make
an idiom ill-formed, for example in people as poor as church mice .
A nice way to test whether a combination is a collocation is to translate
it into another language. If we cannot translate the combination word byword, then that is evidence that we are dealing with a collocation. Forexample, translating make a decision into French one word at a time we get
faire une décision which is incorrect. In French we have to say prendre une
décision . So that is evidence that make a decision is a collocation in English.
Some authors have generalized the notion of collocation even further
and included cases of words that are strongly associated with each other,but do not necessarily occur in a common grammatical unit and with aparticular order, cases like doctor –nurse orplane –airport . It is probably
best to restrict collocations to the narrower sense of grammatically boundelements that occur in a particular order and use the terms association and
ASSOCIATION
co-occurrence for the more general phenomenon of words that are likely to CO-OCCURRENCE
be used in the same context.
It is instructive to look at the types of collocations that a purely linguistic
analysis of text will discover if plenty of time and person power is availableso that the limitations of statistical analysis and computer technology needbe of no concern. An example of such a purely linguistic analysis is the BBICombinatory Dictionary of English (Benson et al. 1993). In Table 5.18, we

174 5 Collocations
show some of the collocations (or combinations as the dictionary prefers to
call them) of strength and power that the dictionary lists.8We can see im-
mediately that a wider variety of grammatical patterns is considered here(in particular patterns involving prepositions and particles). Naturally, thequality of the collocations is also higher than computer-generated lists – aswe would expect from a manually produced compilation.
We conclude our discussion of the concept of collocation by going through
some subclasses of collocations that deserve special mention.
Verbs with little semantic content like make ,take and doare called light
LIGHT VERBS
verbs in collocations like make a decision ordo a favor . There is hardly any-
thing about the meaning of make ,take ordothat would explain why we
have to say make a decision instead of take a decision and do a favor instead of
make a favor , but for many computational purposes the correct light verb for
combination with a particular noun must be determined and thus acquiredfrom corpora if this information is not available in machine-readable dic-tionaries. Dras and Johnson (1996) examine one approach to this problem.
Verb particle constuctions orphrasal verbs are an especially important part
VERB PARTICLE
CONSTUCTIONS
PHRASAL VERBSof the lexicon of English. Many verbs in English like to tell off andto go down
consist of a combination of a main verb and a particle. These verbs oftencorrespond to a single lexeme in other languages ( réprimander ,descendre in
French). This type of construction is a good example of a collocation withoften non-adjacent words.
Proper nouns (also called proper names ) are usually included in the cate-
PROPER NOUNS
PROPER NAMESgory of collocations in computational work although they are quite differ-
ent from lexical collocations. They are most amenable to approaches that
look for ﬁxed phrases that reappear in exactly the same form throughout a
text.
Terminological expressions refer to concepts and objects in technical do- TERMINOLO GICAL
EXPRESSIONS mains. Although they are often fairly compositional (e.g., hydraulic oil ﬁl-
ter), it is still important to identify them to make sure that they are treated
consistently throughout a technical text. For example, when translating amanual, we have to make sure that all instances of hydraulic oil ﬁlter are
translated by the same term. If two different translations are used (even ifthey have the same meaning in some sense), the reader of the translatedmanual would get confused and think that two different entities are beingdescribed.
As a ﬁnal example of the wide range of phenomena that the term colloca-
8. We cannot show collocations of strong and powerful because these adjectives are not listed
as entries in the dictionary.

5.6 Further Reading 175
tion is applied to, let us point to the many different degrees of invariability
that a collocation can show. At one extreme of the spectrum we have usagenotes in dictionaries that describe subtle differences in usage between near-synonyms like answer and reply (diplomatic answer vs.stinging reply ). This
type of collocation is important for generating text that sounds natural, butgetting a collocation wrong here is less likely to lead to a fatal error. Theother extreme are completely frozen expressions like proper names and id-ioms. Here there is just one way of saying things and any deviation willcompletely change the meaning of what is said. Luckily, the less composi-tional and the more important a collocation, the easier it often is to acquireit automatically.
5.6 Further Reading
See (Stubbs 1996) for an in-depth discussion of the British tradition of “em-piricist” linguistics.
Thettest is covered in most general statistics books. Standard refer-
ences are (Snedecor and Cochran 1989: 53) and (Moore and McCabe 1989:541). Weinberg and Goldberg (1990: 306) and Ramsey and Schafer (1997)are more accessible for students with less mathematical background. Thesebooks also cover the/#1F
/2test, but not some of the other more specialized
tests that we discuss here.
One of the ﬁrst publications on the discovery of collocations was (Church
and Hanks 1989), later expanded to (Church et al. 1991). The authors drewattention to an emerging type of corpus-based dictionary (Sinclair 1995)
and developed a program of computational lexicography that combines
corpus evidence, computational methods and human judgement to buildmore comprehensive dictionaries that better reﬂect actual language use.
There are a number of ways lexicographers can beneﬁt from automated
processing of corpus data. A lexicographer writes a dictionary entry af-ter looking at a potentially large number of examples of a word. If theexamples are automatically presorted according to collocations and othercriteria (for example, the topic of the text), then this process can be mademuch more efﬁcient. For example, phrasal verbs are sometimes neglectedin dictionaries because they are not separate words. A corpus-based ap-proach will make their importance evident to the lexicographer. In addi-tion, a balanced corpus will reveal which of the uses are most frequentand hence most important for the likely user of a dictionary. Differencetests like thettest are useful for writing usage notes and for writing ac-

176 5 Collocations
curate deﬁnitions that reﬂect differences in usage between words. Some
of these techniques are being used for the next generation of dictionaries(Fontenelle et al. 1994).
Eventually, a new form of dictionary could emerge from this work, a kind
of dictionary-cum-corpus in which dictionary entry and corpus evidencesupport each other and are organized in a coherent whole. The COBUILDdictionary already has some of these characteristics (Sinclair 1995). Sincespace is less of an issue with electronic dictionaries plenty of corpus exam-ples can be integrated into a dictionary entry for the interested user.
What we have said about the value of statistical corpus analysis for
monolingual dictionaries applies equally to bilingual dictionaries, at leastif an aligned corpus is available (Smadja et al. 1996).
Another important application of collocations is Information Retrieval
(IR). Accuracy of retrieval can be improved if the similarity between a userquery and a document is determined based on common collocations (orphrases) instead of common words (Fagan 1989; Evans et al. 1991; Strza-lkowski 1995; Mitra et al. 1997). See Lewis and Jones (1996) and Krovetz(1991) for further discussion of the question of using collocation discov-ery and
NLP in Information Retrieval and Nevill-Manning et al. (1997) for
an alternative non-statistical approach to using phrases in IR. Steier andBelew (1993) present an interesting study of how the treatment of phrases(for example, for phrase weighting) should change as we move from asubdomain to a general domain. For example, invasive procedure is com-
pletely compositional and a less interesting collocation in the subdomainof medical articles, but becomes interesting and non-compositional when
“exported” to a general collection that is a mixture of many specialized
domains.
Two other important applications of collocations, which we will just
mention, are natural language generation (Smadja 1993) and cross-languageinformation retrieval (Hull and Grefenstette 1998).
An important area that we haven’t been able to cover is the discovery
of proper nouns, which can be regarded as a kind of collocation. Propernouns cannot be exhaustively covered in dictionaries since new people,places, and other entities come into existence and are named all the time.Proper nouns also present their own set of challenges: co-reference (Howcan we tell that IBM and International Bureau Machines refer to the sameentity?), disambiguation (When does AMEX refer to the American Exchange,when to American Express?), and classiﬁcation (Is this new entity that thetext refers to the name of a person, a location or a company?). One of the
earliest studies on this topic is (Coates-Stephens 1993). McDonald (1995)

5.6 Further Reading 177
focuses on lexicosemantic patterns that can be used as cues for proper noun
detection and classiﬁcation. Mani and MacMillan (1995) and Paik et al.(1995) propose ways of classifying proper nouns according to type.
One frequently used measure for interestingness of collocations that we
did not cover is thezscore , a close relative of the ttest. It is used in sev- zSCORE
eral software packages and workbenches for text analysis (Fontenelle et al.
1994; Hawthorne 1994). The zscore should only be applied when the vari-
ance is known, which arguably is not the case in most Statistical NLP ap-
plications.
Fisher’s exact test is another statistical test that can be used for judging
how unexpected a set of observations is. In contrast to the ttest and the/#1F
/2test, it is appropriate even for very small counts. However, it is hard
to compute, and it is not clear whether the results obtained in practice are
much different from, for example, the /#1F
/2test (Pedersen 1996).
Yet another approach to discovering collocations is to search for points
in the word stream with either low or high uncertainty as to what the next(or previous) word will be. Points with high uncertainty are likely to bephrase boundaries, which in turn are candidates for points where a collo-cation may start or end, whereas points with low uncertainty are likely tobe located within a collocation. See (Evans and Zhai 1996) and (Shimohataet al. 1997) for two approaches that use this type of information for ﬁndingphrases and collocations.

Copyright Notice

© Licențiada.org respectă drepturile de proprietate intelectuală și așteaptă ca toți utilizatorii să facă același lucru. Dacă consideri că un conținut de pe site încalcă drepturile tale de autor, te rugăm să trimiți o notificare DMCA.

Acest articol: DRAFT c0DJanuary7, 1999Christopher Manning HinrichSchütze. 141 [600375] (ID: 600375)

Dacă considerați că acest conținut vă încalcă drepturile de autor, vă rugăm să depuneți o cerere pe pagina noastră Copyright Takedown.

DRAFT c0DJanuary7, 1999Christopher Manning HinrichSchütze. 141 [600375]

Copyright Notice

See discussions, st ats, and author pr ofiles f or this public ation at : https:www .researchgate.ne tpublic ation233662607 [626391]

Lucrare de licent a [616674]

Rolul psihologului în medierea unor discuții contradictorii părinte-copil și în consilierea acestora. Obiectivul acestei activități a fost dezbaterea… [309476]

ACADEMIA DE STUDII ECONOMICE DIN BUCUREȘTI [605576]

Turismul este considerat o activitate social-culturală și economică, cu funcții complexe, care vizează recreerea și recuperarea, pe fondul… [304296]

Chișinău 2009ARSURILE TERMICE [631542]

Copyright Notice

Similar Posts