At the HathiTrust Research Center, we’re often asked about metadata coverage for the nearly 15 million HathiTrust records. Though we provide additional derived features like author gender and language inference, the metadata is generally what arrives from institutional providers via HathiTrust. HathiTrust provides some baseline guidelines for partners, but beyond those, what you can expect is dependent on what institutional partners provide.
For a sense of what that is, below is a list of the most common MARC fields. Crunching this didn’t involve any special access through the Research Center: you could easily access the same records via HathiTrust’s Bibliographic API (and hey, some code!).
The good news is that at the scale of the HathiTrust’s collection, even a small random fraction of the full collection is sufficient to see many quantitative patterns. We see this often at the Research Center, where we work largely at aggregate levels over thousands or millions of volumes: term frequencies, topic models, and other types of distributions converge much, much earlier. The bad news is that you can’t be sure that the biases in the missing vs. included data are random. For that, you’ll have to look more closely at a field that you’re interested in.
This data is presented for curiosity with little commentary, but I will offer one pro-tip: If you’re looking to get a thematic classification of a volume, the ~50% coverage of Library of Congress call numbers is a good place to start.
Continue reading “MARC Fields in the HathiTrust”
Creating a Twitter bot is a great exercise for formalizing a simple concept in a concrete implementation. Some of the best bots demonstrate this simplicity: a nugget of an idea, with the nuance in the details. To implement a bot usually requires some programming, some data wrangling, and a server. However, it can be easier. By patching together some open datasets and a hosted version of a generative grammar, I’ll describe how to build a simple bot in 20 minutes. Continue reading “Your First Twitter Bot, in 20 minutes”
I’m an information scientist with a digital humanities background, specializing in large-scale text analysis, crowds systems, and information retrieval over novel datasets.
Look at my CV, or contact me for a chat.
With data from NYPL Labs’ What’s on the Menu?
I just put up a modest reference repository with various slices of data on US names. I included an estimate of names among US-born citizens today, by cross-referencing baby names data and population age distribution for 2014, and gender probabilities by name. Find it on Github.
The richness of language can be under-appreciated because of its mundane nature. James Somers’s essay You’re probably using the wrong dictionary recently turned me on to old dictionaries, which – with colorful descriptions and honest uncertainties – gratify much more than what we’ve come to expect of dictionaries. While modern dictionaries give you matter-of-fact descriptions of words you don’t know, older dictionaries have a vivid, more exciting style that is equally likely to enlighten you about words you do know. Tracking down references made by John McPhee about his own dictionary, Somers recommends Webster’s Revised Unabridged 1913 dictionary.
Reading Webster’s 1913 is a satisfying exercise. What strikes me most, however, are the descriptions of slang, colloquialisms, and vulgarities. These are terms or uses which are informal, conversational; the dictionary’s etymology for slang notes its roots in ‘having no just reason for being.’ With these entries, a work now seen as a record of American English is defining language which, by its own description, is “unauthorized”.
The tension results in a wonderful series of entries, some that are very familiar to us:
Continue reading “Old Slang: Appreciating Webster’s with Bots”