Beyond tokens: what character counts say about a page

When talking about quantitative features in text analysis the term token count is king, but other features can help infer the content and context of a page. I demonstrate visually how the characters at the margins of a page can show us intuitively sensible patterns in text.

Continue reading “Beyond tokens: what character counts say about a page”

A Dataset of Term Stats in Literature

Following up on Term Weighting for Humanists, I’m sharing data and code to apply term weighting to literature in the HTRC’s Extracted Features dataset.

Crunched for 235,000 Language and Literature (i.e. LCC Class P) volumes, I’ve uploaded two datasets: IDF values for the terms, and a more generally useful dataset of frequency counts.
Continue reading “A Dataset of Term Stats in Literature”

Add user pseudonyms in data analysis

When analyzing anonymous user data in a team, I often take an extra step to help discussion: converting user identifiers to popular English name pseudonyms.

Pseudonyms tend to make the data more welcoming to team members that aren’t working directly with it, and helps you follow trends and outliers. It also helps in your visual sanity checks during analysis: names are simply easier to remember, thus helping you spot problems when inspecting the data.

Popular baby names are readily provided by the Social Security office, and I usually keep a derivative text list handy. In the simplest case, you can simply convert each unique id into a name. When I want to safeguard against name assignments changing as the data changes, I’ll save the ID>Name conversions in a basic CSV.

Below is a very basic example written in R to show how easy it is to do: