Following up on Term Weighting for Humanists, I’m sharing data and code to apply term weighting to literature in the HTRC’s Extracted Features dataset.
Crunched for 235,000 Language and Literature (i.e. LCC Class P) volumes, I’ve uploaded two datasets: IDF values for the terms, and a more generally useful dataset of frequency counts.
Continue reading “A Dataset of Term Stats in Literature”
This post is about words. Specifically, an intuition about words, one which is meant to grasp at the aboutness of a text. The idea is simple: that not all words are equally valuable. Here, I’ll introduce one of the foundational ways that people have tried to formalize this intuition: TF-IDF.
Continue reading “Term Weighting for Humanists”
Based on an XKCD comic, the Up-Goer Five text editor only lets you write using the one thousand most common words in English. Here are my attempts to describe what I do in crowdsourcing and information retrieval using only common words.
How do you find something from a lot of written stuff? If there are hundreds of things or more, you can’t look at all of them! One way we can find things is to use the words to understand the ideas. Then, when you search with a question, we can find the ideas that you are asking about and find the written things that answer your question. However, words and ideas aren’t exactly the same thing, so we look for ways to make better how a computer understands the ideas in your question and in the stuff you’re looking through.
When people get together on computers, they make fun, cool, and strange things. My job is to understand why they do it, and how we can can work together to fix problems in the same ways.
In Historical Note: Information Retrieval and the Future of an Illusion (1988) Don Swanson offers his experienced perspective on IR and problems that we’ve ignored. He suggests that we explicate so-called ‘postulates of impotence’: statements of what cannot be done.
Swanson offers some postulates of impotence himself, as well as postulates of fertility. I like the idea of a research community formalizing the lies they’ve been telling themselves, so I’ve reproduced a truncated version of his postulates below.
In his postulates of impotence (PI), Swanson argues that fully automatic indexing and retrieval is not effectively possible. While he admits that computing brings many benefits or scale and speed, his PIs seek to remind us that it doesn’t necessarily mean that we are better at retrieval.
In his postulates of fertility (PF), Swanson offers a little-explored area where IR can help: making connections between disparate information that had not been considered previously. He cites scientific fields as a place where there is limited discussion and citation across field boundaries, but where doing so is extremely useful.
Do you have your own postulates of impotence for your field?
Continue reading “Postulates of Impotence”