HTRC Feature Reader 2.0

I’ve released an overhaul of the HTRC Feature Reader, a Python library that makes it easy to work with the Extracted Features (EF) dataset from the HathiTrust. EF provides page-level feature counts for 4.8 million volumes, including part-of-speech tagged term counts, line and sentence counts, and counts of which characters occur in the far right and left sides of the text. The Feature Reader provides easy parsing of the dataset format and in-memory access to different views of the features. This new version works in service of the SciPy stack of data analysis tool – particularly Pandas. I’ve also transferred the code to the HathiTrust Research Center organization, and it is the first version that can be installed by pip:

pip install htrc-feature-reader

If you want to jump into using the HTRC Feature Reader, the README walks you through the classes and their methods, the documentation provides more low-level detail, and the examples folder features Jupyter notebooks with various small tutorials. One such example is how to plot sentiment in the style of Jockers’s plot arcs. The focus of this post is explaining the new version of the Feature Reader.

download (4).png
Chart from the Within Books Sentiment Trends tutorial

Continue reading “HTRC Feature Reader 2.0”

Your First Twitter Bot, in 20 minutes

Creating a Twitter bot is a great exercise for formalizing a simple concept in a concrete implementation. Some of the best bots demonstrate this simplicity: a nugget of an idea, with the nuance in the details. To implement a bot usually requires some programming, some data wrangling, and a server. However, it can be easier. By patching together some open datasets and a hosted version of a generative grammar, I’ll describe how to build a simple bot in 20 minutes. Continue reading “Your First Twitter Bot, in 20 minutes”

Running Maps

Running in Boston

Motorola’s now-discontinued MotoACTV sportswatch gives you the commendable option to download all your running routes.

With a touch of data hacking, some manual editing to remove redundant routes, and some beautiful map tiles from Stamen, I ended up with a nice record of the places that I visited in 2012/13 and the parts of my town explored.

Continue reading “Running Maps”

Low-Effort Crowdsourcing

Sentence generation with choice-based typing. The program prompts a user to choose one of two words that are likely to come after the previous words, allowing them to generate a whole sentence by low-effort interaction.—programmed by Jeff Bigham

How small can a crowdsourcing contribution be?

At November’s CrowdCamp workshop, a group of us got together and prototyped a number of sample systems to see how low-effort crowdsourcing would work. We posted a report at Follow the Crowd.

Our prototypes were silly at times, but helped us think about the mixture of low-effort input methods and non-distracting user contexts where low-effort crowdsourcing would work.

The ideas we prototyped, available at Github, include:

  • A binary tweeting interface, that lets you type sentences using a choice between common words
  • A passive image voting interface that captures a user’s smile as a ‘like’
  • A browser extension proof-of-concept that lets a worker complete tasks while a page is load
  • A hot-or-not style interface for choosing the better of two choices. The twist is that you’re choosing using affirmative grunts, so you can play it while listening (or pretending to listen?) to somebody!

Uh-huh. Yeah.

The emotive voting interface ‘likes’ an image if you smile while the image is on the screen, and ‘dislikes’ if you frown.

Details at Follow the Crowd. Team was Jeff Bigham, Kotaro Hara, Rajan Vaish,  Haoqi Zhang, and myself.

Progress Bar Timer


A general purpose productivity tracker

I just published Progress Bar Timer in the Chrome Web Store. It lets you set up general purpose trackers in the form of progress bars. There are counters, timers, and clocks. Code is at Github, so feel free to submit bug reports and suggestions.

The application was designed toward my productivity habits but – spurred by the sense of the public eye on Github – I’ve tried to make it useful to others. My favorite use has been to combine a counter and a clock side-by-side. For example, during my field exam, I maintained a bar of word count progress alongside a bar showing where in the two week writing period I was.

Progress Bar

Continue reading “Progress Bar Timer”

Crowdsourcing Swift

Earlier this week I led a class on the topic of Crowdsourcing. Since our discussion was focused primarily on Human Computation research, I took the opportunity to show a live demonstration of Mechanical Turk.

After some thought of appropriate perception-based tasks that could be outsourced to workers and return meaningful results between the beginning and end of class, I settled on a modern day rewrite of Jonathan Swift’s A Modest Proposal.

If you haven’t read Swift’s famous 18th century satire, I encourage you to do so. In it, Swift describes the plights of the impoverished in Ireland before offering a solution: for the poor to sell their children to the wealthy for food. The brilliance of the piece is in the cold rhetoric being used to argue for such a shocking proposition.

Of course, a modern read of A Modest Proposal as a satire is different from a completely naive read, one where the reveal of the proposal is truly a shock and where there’s a risk that a reader may not recognize it as satire at all.

This is why the idea of paying workers to rewrite it in plain English, sentence-by-sentence with no context, provided much amusement. What would these workers think, looking at this sentence written in such unassuming prose and deciphering it, only to realize that it is about cannibalism. Even better, I suspected most wouldn’t realize it, except for those rewriting a select few sentences.

I compartmentalized the task into two steps: rewriting and voting. To add limitations to task of rewriting and constrain turkers from simply offering back the same line, I had the rewrites done as tweets, which is to say written in 140 characters of less. Each line was rewritten either two or three times (starting with three, I lowered the count after observing less noise than expected) before being promoted to the voting stage. In the voting stage, workers were presented with the original sentence and rewrites, and chose the best one.

The rewriting and voting modules were written in PHP and MySQL over the weekend, and then modified to fit into Mechanical Turk tasks using Amazon’s Command Line Tools. I paid $0.11 for each rewritten sentence and $0.02 for each vote. At 64 sentences, this cost around twenty dollars, though the rewriting wage was notably higher than comparable tasks on the site.

I have a somewhat hesitant relationship with paid human computation. Crowdsourcing with volunteers forces the organizer to be considerate of the crowds and offer them a satisfying intrinsic reward, but once you’re paying them it’s easy to see people as simply labour, because they are. Though this isn’t inherently bad, it introduces a slippery slope to an exploitative relationship. A Modest Proposal criticizes such dehumanization of citizens by using a systems-level look, appropriate considering the experiment was partially a response to Soylent, a Microsoft Office plug-in for outsourcing document proofing on Turk.

The crowdsourced Swift is on Twitter now, repulsing people with his views over the upcoming week. Follow him at @swiftsays.