A Dataset of Term Stats in Literature

Following up on Term Weighting for Humanists, I’m sharing data and code to apply term weighting to literature in the HTRC’s Extracted Features dataset.

Crunched for 235,000 Language and Literature (i.e. LCC Class P) volumes, I’ve uploaded two datasets: IDF values for the terms, and a more generally useful dataset of frequency counts.

The stats provided for each term in the frequency dataset are book frequency – how many books the term occurs in, page frequency – how many pages the term occurs on, and corpus frequency, the total count of the term across all 235k volumes. The IDF dataset provides inverse book frequency, and inverse page frequency information calculated from these stats. Crunching the IDF from frequencies is simple (the two lines of code are at the bottom of this post), but I’m sharing the dataset as a convenience.

With an extracted features volume loaded through the HTRC Feature Reader, you can calculate term weights for the volume with the IDF dataset and the short calculate_tfidf() function described below:

tokenlist = vol.term_volume_freqs(pos=False, page_freq=False)
calculate_tfidf(tokenlist, idf_weights)

Loading Dataset

The dataset has raw counts, but you can calculate the IBF weights (IDF for book-as-document) and IPF (IDF for page-as-document) weights using just two lines:

import pandas as pd
import numpy as np
idf_weights = pd.read_csv('classP-idf.csv.bz2', compression='bz2').set_index('token')

Applying IDF weights to HTRC Feature Files

calculate_tfidf() takes a “token, count” dataframe and an IDF dataframe. It returns TF*IDF weights, with log-normalized TF and using pages as the IDF document frame.

def calculate_tfidf(tokencounts, idf_df):
tfidf = pd.merge(tokencounts.set_index('token'), idf_df, left_index=True, right_index=True)
tfidf['TF'] = tfidf['count'].add(1).apply(np.log10)
tfidf['TF*IDF'] = tfidf['TF'] * tfidf['IPF']
return tfidf.sort_values('TF*IDF', ascending=False)

To calculate weights for a volume in the Extracted Features dataset, you can use the vol.term_volume_freq() method of the HTRC Feature Reader to extract the token counts needed for calculate_tfidf():

vol = next(FeatureReader("frankenstein.json.bz2").volumes())
tl = vol.term_volume_freqs(pos=False, page_freq=False)
weighted = calculate_tfidf(tl, all_freqs)

That’s it. Here are the top words from Frankenstein, excluding proper nouns:

cottagers, daemon, 17—, fiend, protectors, sensations, creator, hovel, murderer, ice-raft, sledge, cottage, miserable, endeavoured, monster, misery, ice, endured, quitted, horror, abhorred, wretchedness, feelings, ardently, murdered, despair, fellow-creatures, misfortunes, agony, benevolent, tranquillity, beheld, wretch, ardour

Advanced

The post above is complete, though there may be customizations that you want to make, or you may hope to crunch frequency statistics for such a large subset yourself. These are addressed here.

Using case-insensitive IDF

If you want lowercase stats, create a lowercase column from the raw frequency dataset and use groupby('token').sum():

all_freqs = pd.read_csv('classP-stats.csv.bz2', compression='bz2',
encoding='utf-8').set_index('token')
all_freqs_lower = all_freqs.copy()
all_freqs_lower['token'] = all_freqs_lower.index.str.lower()
all_freqs_lower = all_freqs_lower.groupby('token').sum()
for DF in ['BF', 'PF']:
all_freqs_lower['I'+DF] = all_freqs_lower[DF].rtruediv(all_freqs_lower[DF].max()).add(1).apply(np.log10)

More control

To use a non-normalized TF, case-insensitive token counts, or book frequency rather than page frequency, here is a more detailed calculate_tfidf:

def calculate_tfidf(tokencounts, idf_df, df='PF', case=True, log_tf=True):
''' Takes a "token, count" DF and returns TF*IDF weights '''
if not case:
tc['token'] = tc['token'].str.lower()
tc = tc.groupby('token', as_index=False).sum()
tfidf = pd.merge(tc.set_index('token'), idf_df, left_index=True, right_index=True)
if log_tf:
tfidf['TF'] = tfidf['count'].add(1).apply(np.log10)
else:
tfidf['TF'] = tfidf['count']
tfidf['TF*I'+df] = tfidf['TF'] * tfidf['I'+df]
return tfidf.sort_values('lnTF*IDF', ascending=False)

Crunching stats

Here is the process that I took to calculate book frequency (i.e. DF where the document unit is a book), page frequency (i.e. DF where the document is a page), and collection frequency (i.e. total number of a term’s occurances in a corpus).

Get List of Hathitrust IDs

In this instance, I used the Hathitrust Solr Proxy, which is an index of public domain holdings in the HathiTrust Research Center. Using the Solr query language, I searched for all books with a call number of P* (Language and Literature) and a Language of English. Here is one way to download the list, note the parameters in the url:

curl 'http://chinkapin.pti.indiana.edu:9994/solr/meta/select/?q=callnosort:P*&rows=1000000&fq=+language:English&fl=id&wt=csv' >all-english-P.txt

The response only has one field (‘id’) and returns up to 1 million ids as CSV. There are actually 234994 results: if you change the response to json or xml, there is a “numFound” field that gives the value. The CSV response, looks as follows:

id
mdp.39015030727963
uc2.ark:/13960/t5cc0v81s
miun.adx6300.0001.001
mdp.39015052604397
pur1.32754004380477

Convert IDs to URLs to download the files

The Extracted Features dataset provides unigram counts per page, with each document as it’s own file accessible through Rsync. To get the url for a file, you can convert the ID using utils.id_to_rsync() from the HTRC Feature Reader.

from htrc_features import utils
# Read Ids
with open("all-english-P.txt", "r+") as f:
ids = f.readlines()
# Strip newlines and convert to urls. The first line of ids, the header is skipped
urls = [utils.id_to_rsync(id.strip()) for id in ids[:1]]
with open("english-P-urls.txt", "w+") as f:
f.write("\n".join(urls))

english-P-urls.txt should look like this:

basic/mdp/pairtree_root/39/01/50/30/72/79/63/39015030727963/mdp.39015030727963.basic.json.bz2
basic/uc2/pairtree_root/ar/k+/=1/39/60/=t/5c/c0/v8/1s/ark+=13960=t5cc0v81s/uc2.ark+=13960=t5cc0v81s.basic.json.bz2

Download the feature files

rsync -av --files-from=english-P-urls.txt data.sharc.htrc.illinois.edu:pd-features/ /MY/LOCAL/FOLDER

Calculate the data in chunks

Processing all the files would be too memory-intensive all at once, and too slow in Series. Instead, we’ll process a smaller number of files into our desired page frequency/book frequency/term frequency tables, running many of these processes as once, then fold all the tables together.

For this step, I use GNU Parallel to distribute the processes in parallel. There are two Python scripts.

The first script, map-stats.py, takes a number of feature file paths (however many you want to send it). The script reads the files and crunches a table of book, page, and term frequencies for each term, BF, PF, TF. The table is saved to disk as a pickle (if you don’t know what a pickle is, don’t worry, it’s just a serialization of the Python object so we could load it back in to our next script easy).

The second script, merge_stats_pickles.py, takes the paths of multiple data pickles, opens them and sums all the lines for redundant terms. This approach matches the MapReduce pattern, in it’s informal sense: the first script mapped the data to parallel processes, and the second script reduces their outputs into a single output.

cat english-P-urls.txt | parallel --eta -j90% -n50 python map-stats.py --outpath all-P-pickle/

What’s happening here? cat reads the list of paths that we used for rsync. Assuming that your computer kept the same structure and you’re at the root (where the ‘basic’ directory is from the ‘basic/mdp/pairtree_root/and/so/on’ directory structure), these should point to the correct Extracted Features files on your system. This list is passed to parallel, which split it into chunks of 50 files (-n50 says “send 50 arguments to python”) and starts parallel jobs equal to 90% of your CPU cores (j90%). What I’m referring to as a job is running the python script as the end of that command (python map-stat.py --outpath pickle/directory FILEPATH(1) FILEPATH(2) ... FILEPATH(N)).

Your bottleneck is likely to be memory, where n should be small enough to be run using available RAM / # of jobs. The--eta flag is optional: it gives you an estimate of how long it will take. For example, on my server running 32 jobs in parallel, this is what the processing looks like, with an estimate of about 3 hours:

Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete
ETA: 10704s Left: 2244 AVG: 4.80s local:29/106/100%/5.9s

Neat! The DataFrames of information for 50 files at a time are saved in all-P-pickle. Once all of those are processed, we’ll combine those files.

Because this is such a large job, it likely can’t be reduced all at once using our easy, uncomplicated method. The quick and dirty (but effective) solution is simply to reduce multiple times: maybe combine 10 of those dataframes at a time. The reducer loads the files to memory, reduces the info, then saves it, so a conservative n is a better policy now. You can also filter the data, for example throwing out information for terms that occur on less than x number of page, to drastically reduce the size and filter OCR junk. This is provided as an option called --min-pf.

To reduce the files, send chunks of the exported data to merge_stats_pickles.py. For example:

find all-P-pickle/ -name "*.pickle" | parallel --eta -n10 -j90% python merge_stats_pickles.py --outpath all-P-pickle2/ &&
find all-P-pickle2/ -name "*.pickle" | parallel --eta -n5 -j90% python merge_stats_pickles.py --outpath all-P-pickle3/ --min-pf 1
find all-P-pickle3/ -name "*.pickle" | parallel --eta -n5 -j90% python merge_stats_pickles.py --outpath all-P-pickle4/ --min-pf 2
find all-P-pickle4/ -name "*.pickle" | parallel --eta -n100 -j1 python merge_stats_pickles.py --outpath final-stats/ --min-pf 3

The last process results in a single DataFrame export, which you may want to rename to something descriptive. The in-between exports can be deleted; the script doesn’t do so automatically.

The final frame of frequency stats for the term can be imported into Python using Pandas.

import pandas as pd
data = pd.read_pickle('path/to/export.pickle')

If you would like to share the data, it is generally considered bad practice to share pickle files, because they can be used maliciously. I trimmed the data to terms that have PF>10 and saved to a compressed CSV.

data[data['PF']>10].sort_values('PF', ascending=False).to_csv('final_P_trimmed.csv.bz2', compression='bz2', encoding='utf-8')

Calculate IDF

import numpy as np
# max(DF) is used as an approximation of N
for DF in ['BF', 'PF']:
data['I'+DF] = data[DF].rtruediv(all_freqs[data].max()).add(1).apply(np.log10)

To save as a csv:

data[['IBF', 'IDF']].sort_values('PF', ascending=False).to_csv('final_P_trimmed.csv.bz2', compression='bz2', encoding='utf-8')

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s