Understanding Classified Languages in the HathiTrust


The HTRC Extracted Features (EF) dataset provides two forms of language information: the volume-level bibliographic metadata (what the library record says), as well as machine-classified tags for each page of each volume. To get a sense of when the machine tags are useful, I looked at the 1.8 billion page classifications in the dataset and where they conflict with existing language metadata.

Of the 4.8 million volumes in the dataset, there are 379,839 books where the most-likely language across all pages is different from the bibliographic language, about 8% of the collection. The reasons for these discrepancies are not always clear, and they can indicate issues with the language classifier, or the human cataloguing.

When do you trust the bibliographic record and when do you trust the machine classifier? The simple answer is neither: you trust in when they agree, and stay leery of when they don’t. Continue reading “Understanding Classified Languages in the HathiTrust”

A Dataset of Term Stats in Literature

Following up on Term Weighting for Humanists, I’m sharing data and code to apply term weighting to literature in the HTRC’s Extracted Features dataset.

Crunched for 235,000 Language and Literature (i.e. LCC Class P) volumes, I’ve uploaded two datasets: IDF values for the terms, and a more generally useful dataset of frequency counts.
Continue reading “A Dataset of Term Stats in Literature”

Term Weighting for Humanists

This post is about words. Specifically, an intuition about words, one which is meant to grasp at the aboutness of a text. The idea is simple: that not all words are equally valuable. Here, I’ll introduce one of the foundational ways that people have tried to formalize this intuition: TF-IDF.

Continue reading “Term Weighting for Humanists”