When talking about quantitative features in text analysis the term token count is king, but other features can help infer the content and context of a page. I demonstrate visually how the characters at the margins of a page can show us intuitively sensible patterns in text.
Crunched for 235,000 Language and Literature (i.e. LCC Class P) volumes, I’ve uploaded two datasets: IDF values for the terms, and a more generally useful dataset of frequency counts.
Continue reading “A Dataset of Term Stats in Literature”
This post is about words. Specifically, an intuition about words, one which is meant to grasp at the aboutness of a text. The idea is simple: that not all words are equally valuable. Here, I’ll introduce one of the foundational ways that people have tried to formalize this intuition: TF-IDF.
I’ve released an overhaul of the HTRC Feature Reader, a Python library that makes it easy to work with the Extracted Features (EF) dataset from the HathiTrust. EF provides page-level feature counts for 4.8 million volumes, including part-of-speech tagged term counts, line and sentence counts, and counts of which characters occur in the far right and left sides of the text. The Feature Reader provides easy parsing of the dataset format and in-memory access to different views of the features. This new version works in service of the SciPy stack of data analysis tool – particularly Pandas. I’ve also transferred the code to the HathiTrust Research Center organization, and it is the first version that can be installed by pip:
pip install htrc-feature-reader
If you want to jump into using the HTRC Feature Reader, the README walks you through the classes and their methods, the documentation provides more low-level detail, and the examples folder features Jupyter notebooks with various small tutorials. One such example is how to plot sentiment in the style of Jockers’s plot arcs. The focus of this post is explaining the new version of the Feature Reader.
At the HathiTrust Research Center, we’re often asked about metadata coverage for the nearly 15 million HathiTrust records. Though we provide additional derived features like author gender and language inference, the metadata is generally what arrives from institutional providers via HathiTrust. HathiTrust provides some baseline guidelines for partners, but beyond those, what you can expect is dependent on what institutional partners provide.
For a sense of what that is, below is a list of the most common MARC fields. Crunching this didn’t involve any special access through the Research Center: you could easily access the same records via HathiTrust’s Bibliographic API (and hey, some code!).
The good news is that at the scale of the HathiTrust’s collection, even a small random fraction of the full collection is sufficient to see many quantitative patterns. We see this often at the Research Center, where we work largely at aggregate levels over thousands or millions of volumes: term frequencies, topic models, and other types of distributions converge much, much earlier. The bad news is that you can’t be sure that the biases in the missing vs. included data are random. For that, you’ll have to look more closely at a field that you’re interested in.
This data is presented for curiosity with little commentary, but I will offer one pro-tip: If you’re looking to get a thematic classification of a volume, the ~50% coverage of Library of Congress call numbers is a good place to start.
The richness of language can be under-appreciated because of its mundane nature. James Somers’s essay You’re probably using the wrong dictionary recently turned me on to old dictionaries, which – with colorful descriptions and honest uncertainties – gratify much more than what we’ve come to expect of dictionaries. While modern dictionaries give you matter-of-fact descriptions of words you don’t know, older dictionaries have a vivid, more exciting style that is equally likely to enlighten you about words you do know. Tracking down references made by John McPhee about his own dictionary, Somers recommends Webster’s Revised Unabridged 1913 dictionary.
Reading Webster’s 1913 is a satisfying exercise. What strikes me most, however, are the descriptions of slang, colloquialisms, and vulgarities. These are terms or uses which are informal, conversational; the dictionary’s etymology for slang notes its roots in ‘having no just reason for being.’ With these entries, a work now seen as a record of American English is defining language which, by its own description, is “unauthorized”.
The tension results in a wonderful series of entries, some that are very familiar to us:
Click the image for a map of topics in the Day of Digital Humanities. This was a product of a failed method that I was working on. I meant to share this on Twitter, but compression made it look terrible.