Crunched for 235,000 Language and Literature (i.e. LCC Class P) volumes, I’ve uploaded two datasets: IDF values for the terms, and a more generally useful dataset of frequency counts.
The stats provided for each term in the frequency dataset are book frequency – how many books the term occurs in, page frequency – how many pages the term occurs on, and corpus frequency, the total count of the term across all 235k volumes. The IDF dataset provides inverse book frequency, and inverse page frequency information calculated from these stats. Crunching the IDF from frequencies is simple (the two lines of code are at the bottom of this post), but I’m sharing the dataset as a convenience.
With an extracted features volume loaded through the HTRC Feature Reader, you can calculate term weights for the volume with the IDF dataset and the short
calculate_tfidf() function described below:
tokenlist = vol.term_volume_freqs(pos=False, page_freq=False) calculate_tfidf(tokenlist, idf_weights)
The dataset has raw counts, but you can calculate the IBF weights (IDF for book-as-document) and IPF (IDF for page-as-document) weights using just two lines:
import pandas as pd import numpy as np idf_weights = pd.read_csv('classP-idf.csv.bz2', compression='bz2').set_index('token')
Applying IDF weights to HTRC Feature Files
calculate_tfidf() takes a “token, count” dataframe and an IDF dataframe. It returns TF*IDF weights, with log-normalized TF and using pages as the IDF document frame.
def calculate_tfidf(tokencounts, idf_df): tfidf = pd.merge(tokencounts.set_index('token'), idf_df, left_index=True, right_index=True) tfidf['TF'] = tfidf['count'].add(1).apply(np.log10) tfidf['TF*IDF'] = tfidf['TF'] * tfidf['IPF'] return tfidf.sort_values('TF*IDF', ascending=False)
To calculate weights for a volume in the Extracted Features dataset, you can use the
vol.term_volume_freq() method of the HTRC Feature Reader to extract the token counts needed for
vol = next(FeatureReader(&quot;frankenstein.json.bz2&quot;).volumes()) tl = vol.term_volume_freqs(pos=False, page_freq=False) weighted = calculate_tfidf(tl, all_freqs)
That’s it. Here are the top words from Frankenstein, excluding proper nouns:
|cottagers, daemon, 17—, fiend, protectors, sensations, creator, hovel, murderer, ice-raft, sledge, cottage, miserable, endeavoured, monster, misery, ice, endured, quitted, horror, abhorred, wretchedness, feelings, ardently, murdered, despair, fellow-creatures, misfortunes, agony, benevolent, tranquillity, beheld, wretch, ardour|
The post above is complete, though there may be customizations that you want to make, or you may hope to crunch frequency statistics for such a large subset yourself. These are addressed here.
Using case-insensitive IDF
If you want lowercase stats, create a lowercase column from the raw frequency dataset and use
all_freqs = pd.read_csv('classP-stats.csv.bz2', compression='bz2', encoding='utf-8').set_index('token') all_freqs_lower = all_freqs.copy() all_freqs_lower['token'] = all_freqs_lower.index.str.lower() all_freqs_lower = all_freqs_lower.groupby('token').sum() for DF in ['BF', 'PF']: all_freqs_lower['I'+DF] = all_freqs_lower[DF].rtruediv(all_freqs_lower[DF].max()).add(1).apply(np.log10)
To use a non-normalized TF, case-insensitive token counts, or book frequency rather than page frequency, here is a more detailed
def calculate_tfidf(tokencounts, idf_df, df='PF', case=True, log_tf=True): ''' Takes a &quot;token, count&quot; DF and returns TF*IDF weights ''' if not case: tc['token'] = tc['token'].str.lower() tc = tc.groupby('token', as_index=False).sum() tfidf = pd.merge(tc.set_index('token'), idf_df, left_index=True, right_index=True) if log_tf: tfidf['TF'] = tfidf['count'].add(1).apply(np.log10) else: tfidf['TF'] = tfidf['count'] tfidf['TF*I'+df] = tfidf['TF'] * tfidf['I'+df] return tfidf.sort_values('lnTF*IDF', ascending=False)
Here is the process that I took to calculate book frequency (i.e. DF where the document unit is a book), page frequency (i.e. DF where the document is a page), and collection frequency (i.e. total number of a term’s occurances in a corpus).
Get List of Hathitrust IDs
In this instance, I used the Hathitrust Solr Proxy, which is an index of public domain holdings in the HathiTrust Research Center. Using the Solr query language, I searched for all books with a call number of P* (Language and Literature) and a Language of English. Here is one way to download the list, note the parameters in the url:
curl 'http://chinkapin.pti.indiana.edu:9994/solr/meta/select/?q=callnosort:P*&amp;rows=1000000&amp;fq=+language:English&amp;fl=id&amp;wt=csv' &gt;all-english-P.txt
The response only has one field (‘id’) and returns up to 1 million ids as CSV. There are actually
234994 results: if you change the response to json or xml, there is a “numFound” field that gives the value. The CSV response, looks as follows:
id mdp.39015030727963 uc2.ark:/13960/t5cc0v81s miun.adx6300.0001.001 mdp.39015052604397 pur1.32754004380477
Convert IDs to URLs to download the files
The Extracted Features dataset provides unigram counts per page, with each document as it’s own file accessible through Rsync. To get the url for a file, you can convert the ID using
utils.id_to_rsync() from the HTRC Feature Reader.
from htrc_features import utils # Read Ids with open(&quot;all-english-P.txt&quot;, &quot;r+&quot;) as f: ids = f.readlines() # Strip newlines and convert to urls. The first line of ids, the header is skipped urls = [utils.id_to_rsync(id.strip()) for id in ids[:1]] with open(&quot;english-P-urls.txt&quot;, &quot;w+&quot;) as f: f.write(&quot;\n&quot;.join(urls))
english-P-urls.txt should look like this:
Download the feature files
rsync -av --files-from=english-P-urls.txt data.sharc.htrc.illinois.edu:pd-features/ /MY/LOCAL/FOLDER
Calculate the data in chunks
Processing all the files would be too memory-intensive all at once, and too slow in Series. Instead, we’ll process a smaller number of files into our desired page frequency/book frequency/term frequency tables, running many of these processes as once, then fold all the tables together.
For this step, I use GNU Parallel to distribute the processes in parallel. There are two Python scripts.
The first script, map-stats.py, takes a number of feature file paths (however many you want to send it). The script reads the files and crunches a table of book, page, and term frequencies for each
term, BF, PF, TF. The table is saved to disk as a pickle (if you don’t know what a pickle is, don’t worry, it’s just a serialization of the Python object so we could load it back in to our next script easy).
The second script, merge_stats_pickles.py, takes the paths of multiple data pickles, opens them and sums all the lines for redundant terms. This approach matches the MapReduce pattern, in it’s informal sense: the first script mapped the data to parallel processes, and the second script reduces their outputs into a single output.
cat english-P-urls.txt | parallel --eta -j90% -n50 python map-stats.py --outpath all-P-pickle/
What’s happening here?
cat reads the list of paths that we used for rsync. Assuming that your computer kept the same structure and you’re at the root (where the ‘basic’ directory is from the ‘basic/mdp/pairtree_root/and/so/on’ directory structure), these should point to the correct Extracted Features files on your system. This list is passed to parallel, which split it into chunks of 50 files (
-n50 says “send 50 arguments to python”) and starts parallel jobs equal to 90% of your CPU cores (
j90%). What I’m referring to as a job is running the python script as the end of that command (
python map-stat.py --outpath pickle/directory FILEPATH(1) FILEPATH(2) ... FILEPATH(N)).
Your bottleneck is likely to be memory, where n should be small enough to be run using
available RAM / # of jobs. The
--eta flag is optional: it gives you an estimate of how long it will take. For example, on my server running 32 jobs in parallel, this is what the processing looks like, with an estimate of about 3 hours:
Computer:jobs running/jobs completed/%of started jobs/Average seconds to complete ETA: 10704s Left: 2244 AVG: 4.80s local:29/106/100%/5.9s
Neat! The DataFrames of information for 50 files at a time are saved in
all-P-pickle. Once all of those are processed, we’ll combine those files.
Because this is such a large job, it likely can’t be reduced all at once using our easy, uncomplicated method. The quick and dirty (but effective) solution is simply to reduce multiple times: maybe combine 10 of those dataframes at a time. The reducer loads the files to memory, reduces the info, then saves it, so a conservative
n is a better policy now. You can also filter the data, for example throwing out information for terms that occur on less than x number of page, to drastically reduce the size and filter OCR junk. This is provided as an option called
To reduce the files, send chunks of the exported data to
merge_stats_pickles.py. For example:
find all-P-pickle/ -name &quot;*.pickle&quot; | parallel --eta -n10 -j90% python merge_stats_pickles.py --outpath all-P-pickle2/ &amp;&amp; find all-P-pickle2/ -name &quot;*.pickle&quot; | parallel --eta -n5 -j90% python merge_stats_pickles.py --outpath all-P-pickle3/ --min-pf 1 find all-P-pickle3/ -name &quot;*.pickle&quot; | parallel --eta -n5 -j90% python merge_stats_pickles.py --outpath all-P-pickle4/ --min-pf 2 find all-P-pickle4/ -name &quot;*.pickle&quot; | parallel --eta -n100 -j1 python merge_stats_pickles.py --outpath final-stats/ --min-pf 3
The last process results in a single DataFrame export, which you may want to rename to something descriptive. The in-between exports can be deleted; the script doesn’t do so automatically.
The final frame of frequency stats for the term can be imported into Python using Pandas.
import pandas as pd data = pd.read_pickle('path/to/export.pickle')
If you would like to share the data, it is generally considered bad practice to share pickle files, because they can be used maliciously. I trimmed the data to terms that have PF>10 and saved to a compressed CSV.
data[data['PF']&gt;10].sort_values('PF', ascending=False).to_csv('final_P_trimmed.csv.bz2', compression='bz2', encoding='utf-8')
import numpy as np # max(DF) is used as an approximation of N for DF in ['BF', 'PF']: data['I'+DF] = data[DF].rtruediv(all_freqs[data].max()).add(1).apply(np.log10)
To save as a csv:
data[['IBF', 'IDF']].sort_values('PF', ascending=False).to_csv('final_P_trimmed.csv.bz2', compression='bz2', encoding='utf-8')