About Me

I’m an information scientist with expertise in crowd systems, text and data mining, and information retrieval.

As a post-doctoral researcher at the HathiTrust Research Center, I work on massive-scale text analysis, teaching computers to read and using those methods to track cultural and  and historic trends across centuries.


Projects

HT+Bookworm

HT+Bookworm project creates robust and novel ways for visualising trends in the 14m book HathiTrust Digital Library. Century-spanning digital library collections are valuable resources for computational social science and the digital humanities, but such scale also makes basic exploration and hypothesis-building difficult. HT+Bookworm allowing scholars to ask questions within detailed sub-facets of the collection. e.g. How … Continue reading “HT+Bookworm”

Extracted Features Dataset

The Extracted Features (EF) Dataset is an unparalleled resource for tracking historical, linguistic, cultural, and structural trends across ages, languages, and topics. It provides access to 13.6 million books as preprocessed and extracted features. These include part-of-speech tagged page-level term counts, character counts, line and sentence information, etc. The EF Dataset is an unparalleled resource for tracking historical, linguistic, cultural, and structural … Continue reading “Extracted Features Dataset”

Ubiquitous Text Analysis

Ubiquitous text analysis looks at the affordances of text analysis in context such as through the TAToo web widget and Bookworm. I have also applied text visualization at a smaller scale, as a mnemonic device for tracking the progress of conceptual trends within a book.


  • Beyond tokens: what character counts say about a page - When talking about quantitative features in text analysis the term token count is king, but other features can help infer the content and context of a page. I demonstrate visually how the characters at the margins of a page can show us intuitively sensible patterns in text.
  • Pico Safari: Active Gaming in Integrated Environments - With the recent release of Pokemon Go, I’m posting my presentation notes from for designing a similar game called Pico Safari in collaboration with Lucio Gutierrez, Garry Wong, and Calen Henry in late 2009. The concept of virtual creatures in the real world follows so nicely from the technological affordances of the past few years, with … Continue reading "Pico Safari: Active Gaming in Integrated Environments"
  • Understanding Classified Languages in the HathiTrust - The HTRC Extracted Features (EF) dataset provides two forms of language information: the volume-level bibliographic metadata (what the library record says), as well as machine-classified tags for each page of each volume. To get a sense of when the machine tags are useful, I looked at the 1.8 billion page classifications in the dataset and where they conflict … Continue reading "Understanding Classified Languages in the HathiTrust"
  • A Dataset of Term Stats in Literature - Following up on Term Weighting for Humanists, I’m sharing data and code to apply term weighting to literature in the HTRC’s Extracted Features dataset. Crunched for 235,000 Language and Literature (i.e. LCC Class P) volumes, I’ve uploaded two datasets: IDF values for the terms, and a more generally useful dataset of frequency counts.
  • Term Weighting for Humanists - This post is about words. Specifically, an intuition about words, one which is meant to grasp at the aboutness of a text. The idea is simple: that not all words are equally valuable. Here, I’ll introduce one of the foundational ways that people have tried to formalize this intuition: TF-IDF.
  • HTRC Feature Reader 2.0 - I’ve released an overhaul of the HTRC Feature Reader, a Python library that makes it easy to work with the Extracted Features (EF) dataset from the HathiTrust. EF provides page-level feature counts for 4.8 million volumes, including part-of-speech tagged term counts, line and sentence counts, and counts of which characters occur in the far right … Continue reading "HTRC Feature Reader 2.0"
  • Git tip: Automatically converting iPython notebook READMEs to Markdown - A small but useful tip today, on using iPython notebooks for a git project README while keeping an auto-generated version in the Markdown format that Github prefers.
  • MARC Fields in the HathiTrust - At the HathiTrust Research Center, we’re often asked about metadata coverage for the nearly 15 million HathiTrust records. Though we provide additional derived features like author gender and language inference, the metadata is generally what arrives from institutional providers via HathiTrust. HathiTrust provides some baseline guidelines for partners, but beyond those, what you can expect is … Continue reading "MARC Fields in the HathiTrust"
  • Your First Twitter Bot, in 20 minutes - I think it was the Pres. at dawn with the Spin Back Knuckle. — bad Clue guesses (@BadClues) September 6, 2015 Creating a Twitter bot is a great exercise for formalizing a simple concept in a concrete implementation. Some of the best bots demonstrate this simplicity: a nugget of an idea, with the nuance in … Continue reading "Your First Twitter Bot, in 20 minutes"
  • I’m on the Job Market! - I’m an information scientist with a digital humanities background, specializing in large-scale text analysis, crowds systems, and information retrieval over novel datasets. Look at my CV, or contact me for a chat.