When talking about quantitative features in text analysis the term token count is king, but other features can help infer the content and context of a page. I demonstrate visually how the characters at the margins of a page can show us intuitively sensible patterns in text.
Crunched for 235,000 Language and Literature (i.e. LCC Class P) volumes, I’ve uploaded two datasets: IDF values for the terms, and a more generally useful dataset of frequency counts.
Continue reading “A Dataset of Term Stats in Literature”
With data from NYPL Labs’ What’s on the Menu?
I just put up a modest reference repository with various slices of data on US names. I included an estimate of names among US-born citizens today, by cross-referencing baby names data and population age distribution for 2014, and gender probabilities by name. Find it on Github.
When analyzing anonymous user data in a team, I often take an extra step to help discussion: converting user identifiers to popular English name pseudonyms.
Pseudonyms tend to make the data more welcoming to team members that aren’t working directly with it, and helps you follow trends and outliers. It also helps in your visual sanity checks during analysis: names are simply easier to remember, thus helping you spot problems when inspecting the data.
Popular baby names are readily provided by the Social Security office, and I usually keep a derivative text list handy. In the simplest case, you can simply convert each unique id into a name. When I want to safeguard against name assignments changing as the data changes, I’ll save the ID>Name conversions in a basic CSV.
Below is a very basic example written in R to show how easy it is to do: