The HTRC Extracted Features (EF) dataset provides two forms of language information: the volume-level bibliographic metadata (what the library record says), as well as machine-classified tags for each page of each volume. To get a sense of when the machine tags are useful, I looked at the 1.8 billion page classifications in the dataset and where they conflict with existing language metadata.
Of the 4.8 million volumes in the dataset, there are
379,839 books where the most-likely language across all pages is different from the bibliographic language, about
8% of the collection. The reasons for these discrepancies are not always clear, and they can indicate issues with the language classifier, or the human cataloguing.
When do you trust the bibliographic record and when do you trust the machine classifier? The simple answer is neither: you trust in when they agree, and stay leery of when they don’t. While the bibliographic record is usually more reliable, the machine classified pages are useful for books of undetermined language, or certain cases of mixed languages. In repeated dives into the original scans in the HathiTrust Digital Library to cross-check the languages, I found that a machine tag that disagrees with the book metadata overwhelming points to non-prose information like tables and illustrations.
Below I’ve collected a set of pointers to help others make sense of the two types of language information in the Extracted Features dataset.
The greatest value in comparing the machine tags and bibliographic metadata is in finding agreement. It is hard to tell the reasons when they disagree, but when they agree (about 93.7% of the time), it’s an additional signal for helping you focus on clean textual content. The EF dataset already provides features for helping ignore noisy information like paratext and headers/footers; unexpected language classification is another hint toward poor OCR or non-textual content.
Books with a lot of variety in classified languages between pages are likely to be non-textual – like musical scores, data tables, or illustrations – rather than multilingual texts. This was disappointing to me, having hoped that they would reveal unknown mixed texts. With the benefit of hindsight, it makes sense that a bevy of language classifications is more likely to indicate a total classification failure and the types of materials that would lead to it. However…
Books with page-level classifications dominated by two fairly balanced languages tend to be multilingual texts. Manually looking at 20 English books subsamples from this category, I found that only 7 were failures of the classifier (six of which were mainly non-prose, like tabular information and illustrations), leaving 65% that are indeed multilingual. Of these 5 were dictionaries, phrasebooks, or writing about the grammar of another language, 5 were side-by-side translations (e.g. New Orleans senate and Canadian parliamentary documents), 2 were non-English texts published with English commentary, and 1 was a multi-language book bibliography.
Certain languages are poorly1 machine-classified, particularly non-European languages. If you are interested in Macedonian, Korean, Kannada, Albanian, Estonian, or Hindi, it is best to ignore the EF machine-classified language probabilities.
In cases like Macedonian and Korean, it appears that the classifier tries to tag many texts as the language, failing most of the time (low precision). On the other end, there is a fair amount of agreement between the classifier and bibliographic metadata for the usual European languages (English, Spanish, German, French, Dutch), as well as Hungarian, Russian, Polish, and Hebrew.
Ignore the machine tags for Latin books! The biggest blindspot for the machine classifier is the lack of a model for Latin. There are
102810 texts bibliographically classified as Latin, all of which are incorrectly labelled by the machine. The common doppelgangers for Latin are Italian, Romanian, or French. The next biggest blindspot is the relatively trivial
4453 volumes in Ottoman Turkish, most of which were tagged as Arabic.
Besides Latin, the most commonly confused languages are shown below:
|machine tag||bib language||count|
We already know that a poor classification model is responsible for incorrect ‘Korean’ tags. For the fairly reliably classified European languages, I was reluctant to say that the machine “confused” one language for another, as it might also be a correction for an incorrect bibliographic record. However, in a manual review, the machine classifier tended to be wrong. For example, in a sample of 20 English volumes that the classifier labelled as German, all were indeed English. They included lots of numbers and science terms – a NASA technical report, a city commissioner’s report, a geological survey, a scientific journal – which provides a sense of what the EF classifier thinks is German. It also shows how the page-level classifications can be useful: even if a page isn’t German, there is nonetheless meaning in its classification as such.
Machine classifications can make sense of books without a known language in the bibliographic record. Though the machine-tagged language classifier looks poor when in conflict with the more reliable bibliographic metadata, the fact is that it is effective most of the time. Particularly, it can be used for the 10.4k texts in the EF dataset with undetermined bibliographic language, or 4.3k texts that are listed as multilingual. From another small sample of 20 books, the EF’s page-level language tags correctly classified 16-172 books.
1 The measure used here is F1 Score, the harmonic mean between precision and recall.
2 One book was classified as Swahili, though like a cataloguer before me, I was stumped: I suspected but couldn’t confirm that it is Kokonga.