Understanding Classified Languages in the HathiTrust

lang-split.PNG

The HTRC Extracted Features (EF) dataset provides two forms of language information: the volume-level bibliographic metadata (what the library record says), as well as machine-classified tags for each page of each volume. To get a sense of when the machine tags are useful, I looked at the 1.8 billion page classifications in the dataset and where they conflict with existing language metadata.

Of the 4.8 million volumes in the dataset, there are 379,839 books where the most-likely language across all pages is different from the bibliographic language, about 8% of the collection. The reasons for these discrepancies are not always clear, and they can indicate issues with the language classifier, or the human cataloguing.

When do you trust the bibliographic record and when do you trust the machine classifier? The simple answer is neither: you trust in when they agree, and stay leery of when they don’t. Continue reading “Understanding Classified Languages in the HathiTrust”