This post is about words. Specifically, an intuition about words, one which is meant to grasp at the aboutness of a text. The idea is simple: that not all words are equally valuable. Here, I’ll introduce one of the foundational ways that people have tried to formalize this intuition: TF-IDF.
How do you identify what a text is about by looking at its words?
One way is to count up all the occurrences of each possible word in a given text. If a word is used often in a text, that word might be meaningful in figuring out the text is about. Let’s try this with Anne of Green Gables:
|‘., ,, the, “, I, and, to, a, of, it, you, in, was, her, that, she, n\’t, Anne, \’s, be, had, with, as, Marilla, for, said, is, at, so, have, ?, do, on, me, all, but, up, not, Diana, would, he, It, ;, Mrs., if, out, But, my, when, \’m, Matthew, did, just, over, She, were, there, think, \’ll, about, could, like, little, them, never, one, your, very, been, know, they, this, are, what, !, Oh, from, by, The, going, we, You, And, go, an, \’ve, \’d, good, down, into, much, his, \’, now, such, any, can, ever, time, see’|
Most frequent ‘words’ in Anne of Green Gables
The words here are referred to as terms, and counting how much they occur in a text is referred to as Term Frequency (TF). The text is referred to as a document. A document is your coherent unit of text. It can be a book, a chapter, a tweet, a web page: whatever is most appropriate for what you’re working with.
However, there’s a problem with using term frequency to represent a text: the most frequent terms are boring! In most English language texts, the most popular terms are utility words with little indication of aboutness, such as “the”, “and”, or “as”.
Let’s look at the distribution of terms in another way, in descending order of frequency. Click the image for an interactive version, to see the words at various parts of the distribution.
While our earlier intuition – that frequently occurring words are more interesting than rarely-used words – still seems to hold, it’s apparent that we don’t want the most popular words. There seems to be a sweet spot, of words less common than ‘and’, ‘or’, ‘as’, but more frequent than you’d normally see in the English language.
One approach we can use to address this is a stop list: a previously compiled list of ‘uninteresting’ words that we filter out.
|. , the “ I and to a of it you in was her that she n’t Anne ‘s be had with as Marilla for said is at so have ? do on me all but up not Diana would he It ; Mrs. if out But my when ‘m Matthew did just over She were there think ‘ll about could like little them never one your very been know they this are what ! Oh from by The going we You And go an ‘ve ‘d good down into much his ‘ now such any can ever time see or who him girl no back too more must say came went thought Lynde will things home Well eyes Miss because after than get before Mr. Barry then has says well There school only ‘re got made right tell really through thing night come some hair look Gilbert told how make which other long always their felt here should Manila away feel last child That When old off white girls something Jane Avonlea ca Green suppose anything might Gables face take herself He day people head Rachel want until heart looked put let wo Ruby We pretty course again where|
Frequency of terms in Anne of Green Gables, with stop list matches marked a red
Stop lists are simple and easy to use, making them useful in many cases. However, they are a blunt instrument, losing some nuance that we may want to keep. Not all words are alike: while ‘the’ might be especially uninformative, other words might be useful when used often. As a simple example, consider the word ‘not’ in the children’s book Green Eggs and Ham, which is used disproportionally to show the speaker’s reluctance to try the titular meal.
|. I not them a like , in ! do you eat will Not with would ? and Sam-I-am And eggs the could green ham train there here Would mouse or anywhere You house box dark fox car tree on Sam A see goat rain may me be Say Could Try boat let so In Eat That am say good So are Here Thank try Let They Do they that spam … If|
Most Frequent Terms in Green Eggs and Ham, ordered from highest TF to lowest, with stoplist terms in red.
Stop lists also require some hard choices on what to remove and what to keep, choices that may not apply in all cases.
The alternative is a simple yet intuitive adjustment to treating words quantitatively: term weighting.
The central concept of term weighting is that not all words are alike. If somebody uses the word ‘be’, it’s not as informative as the word ‘armadillo’, so we can try to weigh the latter to be a more valuable indicator of what a text is about. Rather than removing the word ‘not’ Green Eggs and Ham, we can weigh it down: each individual occurrence won’t be as significant, but the unusually high count of occurrences will mean that it will still factor into our picture of what the story is about. Here is a list of words ordered by weight using the process we’ll be discussing, where the highest weighted words are at the start:
|ham, mouse, eggs, fox, eat, goat, anywhere, Would, spam, green, Sam, train, Not, box, car, Eat, Try, tree, rain, Say, Could, dark, boat, them, like, Thank, house, will, do, !, here, not, I, And, Here, may, You, would, you, could, let, ., ?, A, a, there, or, …, So, in, ,, see, with, try, Let, That, In, am, say, good, and, me, on, the, Do, so, be, are, If, They, they, that|
Notable words in Green Eggs and Ham
Term weighting was popularized in the early 1970s by Karen Spärck Jones (PDF) with the idea of Inverse Document Frequency, or IDF (though term weighted was not a new concept, as Spärck Jones herself noted). This method tries to weigh down very common words in the language, while promoting rare words. Paired with our earlier concept of Term Frequency, we can then judge a document’s terms with TF-IDF: the frequency of a term adjusted by its rarity. TF-IDF is a heuristic trying to capture this idea: words that don’t occur in many documents but which occur a lot in your document are important to the document’s content.
This intuition has been formalized in many approaches to modeling texts, but TF-IDF is a great introduction to term weighting, because of its simplicity. Its value is underpinned by the latter part: IDF.
To understand IDF, three new concepts need to be appreciated: a corpus, corpus frequency, and document frequency.
A corpus is your full collection of documents. If you are using a corpus-level statistic like IDF, it is best when it is composed of alike documents, which makes it easier to see how each individual document deviates from the norm, but you can fall back on information from a large general corpus of all English texts. Google’s NGrams Dataset provides both corpus frequency and document frequency for their Google Books data. For the examples below, I use a 29296 English language literature sample from the HathiTrust Research Center, counted from the Extracted Features dataset using my Feature Reader library.
Corpus frequency and document frequency can be easily confused. Corpus Frequency (CF) is the total count of a term in the corpus, while Document Frequency (DF) is the count of documents that have the given term. Remember that the document is your logical unit of a text; in some cases it could be a book, but it could also be pages is that suits your needs better. Consider the term ‘Paris’ in a corpus that has three books:
‘Paris’ occurs once in the first book (i.e. ), 30 times in the second book, and does not occur in the third book. Corpus frequency, being a total count of words, is 31 (i.e. ). Document Frequency (DF) is 2, because ‘Paris’ only shows up in two documents.
Inverse document frequency (IDF) is – ahem – the inverse of document frequency, as in . It is usually not actually calculated like this, which I’ll get to in a moment, but ‘IDF’ is used generally to refer to this concept, that a term which is seen in more documents ends up with a smaller weight. Inverse Corpus Frequency has been also tried, but for various reasons IDF is more elegant and less prone to peculiarities of individual documents.
The calculation for IDF is usually adjusted to account for unintuitive scaling (e.g. a rare word seen in two documents probably isn’t half as interesting as a rare word seen in one document), to normalize values against corpus size, and to avoid a denominator of 0 for unseen words. The calculation we’ll use is , where is the number of documents in the corpus. The following graphic shows how different weighting is applied with different DFs for a 100-document or 1000-document corpus; e.g. the IDF for a term that occurs in 40 documents out of a 100 document corpus. You can see that using drops down too quickly and doesn’t care about how big your corpus is. Don’t use it.
So, how does our representation of Anne of Green Gables look if we weigh our terms against a corpus of 1000 novels? Let’s multiply TF by IDF, using books as our ‘document’:
|Marilla, ., ,, the, “, I, and, Anne, to, a, of, it, you, in, was, her, Diana, that, n\’t, she, Matthew, \’s, Lynde, Avonlea, be, had, with, as, for, said, is, Barry, at, so, have, ?, Gables, do, on, me, Manila, all, but, Stacy, Josie, Ruby, up, Mrs., Gillis, not, \’m, would, Gilbert, he, Pye, Blythe, It, ;, if, out, my, But, when, Cuthbert, did, \’ll, just, over, She, were, Allan, Rachel, there, think, about, could, like, little, them, never, Shirley, your, very, one, been, know, \’ve, \’d, they, Oh, this, are, what, !, from, ANNE, by, Andrews, going, The|
We’re getting there! Our TF-IDF heuristic is beginning to match our intuitive notion of what words are representative of the document, at least more than earlier. Still, let’s be honest: there are still many problems: ‘and’, ‘to’, ‘a’: all those uninteresting words are still weighted highly.
Make TF Great Again
There are still two problems: the problems inherent to Term Frequency, and the large and variable document size for book-level term weighting.
TF is the trouble child of TF*IDF. Raw term count is not a great measure for comparisons because it scales terribly. The word ‘the’ is 337x times more common in Anne of Green Gables than ‘Charlottetown’, a city 30 miles from the book’s setting, and even IDF doesn’t correct the weighting enough to tell us that the latter is more notable. Manning, Raghavan & Schütze describe the TF problem succinctly and outline popular alternatives.
For this case, I’ll use logarithmic smoothing: . With this smoothing, adding extra occurrences of a term has diminishing returns (i.e. going from 100 to 101 occurrences is much less significant than goes from 4 to 5). This effect is useful because word frequencies usually follow a power-law distribution, so the most popular words are exponentially more common.
A smaller issue is in using books, which are very long, as a document unit. Whether a term occurs even just once in an entire book is a low bar, so there end up being many terms with a very high DF and subsequently low IDF weight. This loses some nuance, affecting the ability to differentiate between the somewhat common and the very common. There are many ways to deal with the length issue: the corpus could be modified so that IDF is calculated only over shorter books, or one can increase the threshold to count when terms occur more that simply once. Yet another possibility is to use pages rather than books: they are of a similar size throughout the corpus, and not extremely large. This makes a slight difference in practice, so I’ll use the page as a document frame.
(As a methodological aside, changing our document frame also affects our TF, which is calculated for a document (‘TF of what?’). TF for a book is a now actually the summed TF for a collection of pages, which doesn’t affect us here.)
With log-normalized TF and page-level IDF, the highest-weighted terms now reflect what matters about the particular book:
|Marilla, Avonlea, Lynde, Gables, Gillis, Matthew, Pye, Manila, Diana, Anne, Josie, Stacy, ANNE, Ruby, Blythe, Cuthbert, MARILLA, Barry, Spurgeon, Carmody, Blewett, Anne-girl, Andrews, Allan, Gilbert, Sloane, Shining, Shirley, gable, Willowmere, Newbridge, geometry, Phillips, Prissy, brooch, Charlottetown, Rachel, Spencer, Moody, Mayflowers, Haunted, Green, Cordelia, GABLES, raspberry, Buote, Slope, Sands, asylum, wincey, Sunday-school, Spencervale, picnic, Orchard, Bell, CONCERT, INVITED, Jane, Entrance, Waters, SURPRISED, Sloanes, recite, concert, Boulter, AVONLEA, CUTHBERT, Minnie, fruit-cake, Elaine, brook, firs, manse, liniment, MATTHEW, Rogerson, LYNDE, Dryad, ridge-pole, firry, Avery, cake, Path, sleeves, orphan, ipecac, spruce, Birch, Josephine, spare-room, Gil, Idlewild, Hammond, Bubble, Lovers, birches, scholarship, Debating, amethyst, SOLEMN|
Words with highest TF*IDF scores in Anne of Green Gables, case-sensitive and using log normalized TF
And a little further down the list, starting with the 300th highest-weighted term:
|narcissi, yard, aisle, real, grove, PUPILS, behave, knitting, teach, roses, hair, rustic, braids, algebra, ROAD, aggravating, sighed, sateen, STACY, headaches, Class, Vow, flat, Camelot, woods, orphans, ORGANIZED, Wednesday, lily, ridiculous, Hopeton, frills, me, bottle, mare, decidedly, red, clasped, Bertram, out-of-doors, anybody, although, PROPERLY, ANTICIPATION, wo, OUT, providential, rapturously, Friday, wreath, Harris, perfectly, shadings, stay, icecream, \’d, beads, plum, gingham, EPOCH, I\’ll, forgive, station-master, moonshine, flowers, skinny, prim, prayer, pedlar, students, cows, \’m, FORMED, recited, scrumptious, carrots, headland, shortcoming, afternoon, rainbows, afterlight, encoring, afternoons, Snow, green, liked, suppose, log, winter, TO, layer, apple-trees, buttercups, poetical, Tillie, hill, /, ambitions, Christmas, nonsense|
You can hover over the interactive graphic to examine the full distribution (clicking ‘autoscale’ will expand to the full 8000 terms).
TF-IDF is a heuristically-derived metric. It is an attempt to codify an intuitive notion, that popular terms are less interesting, specific, or discriminatory, and has remained popular for decades because it seems to work well. In information retrieval, where the concept was first introduced for ranking search results, many modern probabilistic models have indirectly recreated their own versions of this intuition. There have also been recent attempts to provide a theoretical explanation for the function of IDF.
Here, TFIDF was used toward a qualitative goal, prioritizing words in a way that feels right. In practice, TFIDF is better used for quantitative purposes, for representing documents in a less noisy way. For example, if representing documents as vectors in a term-document matrix (as search engines did back when TFIDF was created), using term frequency for the values results in a comparison mainly of the high-frequency terms, while TFIDF values provide more discretionary power.
There’s nothing particularly complex about the idea of term weighting. I suspect that much of the reason that TF*IDF has persisted so long is that it is an elegant formulation of a simple, sensible idea: that words which occur in many documents don’t discriminate like those that occur in few. That’s it.
Want the code to calculate TF*IDF term weights with the HTRC Extracted Features data? I’ll post it next week.