Computational text analysis (revisited)

To gain new insights into favorite authors’ writing styles

If you read enough works by any given writer, you may begin to notice word choices, punctuation, and/or other quirks unique to that writer’s style. For example, Isaac Asimov, I have noticed, appears to like the words sardonic and sardonically, and he is a master of the em dash. Frank Herbert seems to use the word redolent quite often in his Dune series. And Franz Kafka had a really bad habit of leaving his masterpieces unfinished (Tupac demonstrated that premature death is no excuse, here).

I once mentioned my observation of Asimov’s use of sardonic[ally] to another sci-fi junky who has, like me, digested a lot of Asimov over the years. He had an interesting hypothesis. He suggested that, in fact, Asimov doesn’t use sardonic[ally] very often. Rather, he uses these words sometimes, whereas other writers don’t use them at all. Thus, the rare instances in which these words crop up in Asimov’s writing stand out not because of their frequency, but because of their novelty. Using some entry-level computational text-analysis methods, I decided to put this hypothesis to the test.

I selected Asimov’s seminal Foundation trilogy as my sample, as it is within these novels that I recall first noticing Asimov’s use of sardonic[ally] (and I happened to already have the trilogy in a single PDF). I started off by running the stats script from the collection of Useful Python Scripts for Texts(UPST). It had trouble detecting paragraphs, but indicated that the trilogy (front-matter, about the author sections, and introductions included) contains 19,609 sentences and 226,504 words (both of which seem like reasonable numbers). The next step was to find each instance of sardonic[ally] within the text, and compare that with the total number of words within the trilogy. I had planned on using some of the other UPST scripts to get some cool statistics, but the rest of the scripts don’t seem to like the formatting of my text file and I only had so much time to troubleshoot the problem. Short on time, I instead opted to employ the tried and true cmd+f (or ctrl+f for PC and [most] Linux users) method. I searched for “sardonic,” because this single search also detects instances of “sardonically,” being that the latter contains the former. As it turns out, there are only 15 instances of sardonic[ally] within the entire trilogy, which reduces to (roughly) 1 instance per 15,000 words. Maybe it just seemed like more because these books are single-sitting reads.

Anyway, wanting to further test this, I turned to Google. As it turns out, plenty of other people have noticed Asimov’s use of these words. A discussion on the sci-fi stackexchange reveals how others have investigated this with various computational text-analysis methods (thus saving me some time). One member compared the frequency with which sardonic occurs in science-fiction to its occurrence rate in all texts. Another used Google Ngrams to show that both sardonic and sardonically were (relatively) popular during the time period in which Asimov began working on his popular Foundation and Robot series.

Taken together, these data support the idea that Asimov doesn’t actually use sardonic[ally] very often. He just uses it more than others do, which makes sense considering both his main genre and the fact that the words were more popular during his peek writing years than they are now. (I still contend that he employs the em dash far more often than other writers, though.)

Written on September 26, 2016 by Josh Guberman