Firth once said “You shall know a word by the company it keeps”, and he seems to be right. The words that usually appear with another word tell a lot about it. They may explain what the word means, or what the word means for specific users.
Those lexical associations may help us learn how words behave, and they can disambiguate potentially ambiguous words. In the following, I’ll try and answer the question: ‘What is the difference between the two English adjectives: powerful and strong?”
- I scraped an American news website, and extracted the text from it. I ended up having 38 million words of text. I won’t reveal the website name for now. I don’t think it’s illegal, since I’m not distributing the text, but just in case.
- I used a POS tagger to tag the grammatical categories of words. This lets us know whether a word is a noun, adjective, verb, .. etc.
- For every word in this corpus, I extracted the preceding 5 words and the following 5 words.
- I calculated the point-wise mutual information score between the focus word and every word in that 10 word window. PMI is known to disfavour frequent bigrams (two-word groups).
- I dumped this info in a sqlite database. It’s only one table, but the db enables us to easily get different rankings. The db has info on the frequency of the bigram, it’s PMI score, and a pre-computed PMI * frequency score. I have found the PMI * frequency score to yield the best results.
To make things easier, I present the top 100 collocates of powerful and strong as word clouds. The font size shows how import a collocate is. The bigger the font size, the more important the word. The words have been ranked using the PMI * frequency score.
First, here are the lexical associations of powerful:
And here is the same for the rival adjective: strong:
The difference is now obvious. Isn’t it?