I’ve released an overhaul of the HTRC Feature Reader, a Python library that makes it easy to work with the Extracted Features (EF) dataset from the HathiTrust. EF provides page-level feature counts for 4.8 million volumes, including part-of-speech tagged term counts, line and sentence counts, and counts of which characters occur in the far right and left sides of the text. The Feature Reader provides easy parsing of the dataset format and in-memory access to different views of the features. This new version works in service of the SciPy stack of data analysis tool – particularly Pandas. I’ve also transferred the code to the HathiTrust Research Center organization, and it is the first version that can be installed by pip:
pip install htrc-feature-reader
If you want to jump into using the HTRC Feature Reader, the README walks you through the classes and their methods, the documentation provides more low-level detail, and the examples folder features Jupyter notebooks with various small tutorials. One such example is how to plot sentiment in the style of Jockers’s plot arcs. The focus of this post is explaining the new version of the Feature Reader.
Much of this release has been about paring down, moving away from Yet Another Thing to Learn toward Scaffolding for the Things You Are (Likely) Already Doing. The first version of the Feature Reader used custom dictionaries and classes for counting and folding results. It was quick for basic needs, but lacked flexibility. I found over the past two years that I would ignore the library and write custom code to work in Pandas. Now, the library returns Pandas DataFrames where it’s sensible. Pandas is the secret sauce that make data science feasible in Python competitive with R, so supporting it is important for expert users, but I also hope that for newcomers the Feature Reader can provide practical problems to get people comfortable with Pandas and the larger stack of libraries that it co-exists with (i.e. Numpy, Scipy, matplotlib, Seaborn, statsmodels, Scikit Learn, iPython, and Jupyter).
Why use the library? The most apparent benefit is that these are activities you’ll be doing over and over, so it is simply easier to ask for something like vol.tokenlist(pos=False, case=False) than to parse the JSON, convert to a DataFrame, fold words by case, and sum part-of-speech tags.
However, it is not simply the ease of having boilerplate. Part of what I have been focusing on is improving performance for common functions. The main performance bottleneck is reading the data, from file to Pandas DataFrame. I’ve been profiling different approaches for this set of steps and, while there are likely many optimizations to eke out, the current code is 5-6x as fast as what I expect a common user (i.e. thinking of me here!) would do if writing their own parse-to-Pandas code for a script. If you hope to process 100s of thousands of books, that difference matters.
There are other efficiency tricks that the HTRC Feature Reader helps with. Very little is processed at initialization – the raw data is prepared into something more manageable only when you ask for it – and it is cached when it is processed. For example, since most features are available at the page level, if you ask for token counts for page 1, we’ll only create a DataFrame from the JSON data of that page; inversely, if you ask the full volume’s token counts, asking for pages afterward will pull from a cached full-volume DataFrame rather than processing the raw data again.
fr = FeatureReader(paths) for volume in fr.volumes(): for page in volume.pages(): do_something()
In-memory iteration will save your machine!
What else has changed
In addition to the rewrite for Pandas support, performance improvements, transfer to the HTRC, and listing on PyPI for installing through pip, a couple of smaller updates have been pushed.
- Support for advanced feature files. With the Extracted Features v.0.2 dataset, we split out some features to an ‘advanced’ file, for features that are useful for only a few researchers. These are supported in the library now, though not too deeply, because we’re moving away from the basic/advanced split in future EF releases.
- Download URL utility. The extracted features dataset is distributed using Rsync, based on the HathiTrust ID of each volume. If you want to download a specific file, the Feature Reader now has a utility for getting the rsync URL from an id.
- Test suite. Tests written for pytest are now written. Before every push to Github, the code is tested in Python 2 and Python 3. Just to make sure nothing breaks 🙂
The focus on the Feature Reader library for future months is pedagogical: it provides a foundation which can make introductory text analysis concepts more welcoming, so I hope that we can use it for teaching eager new researchers. In addition to growing the tutorials in the Github example folder and pursuing other tutorial venues, there is also an upcoming Python-only update to Within-Book Topic Modeling work that I release on top of the Feature Reader a few years ago, one that will be much easier to use.