Robustness of emotion extraction from 20th Century English books

I’ll give today a short talk at the Big Humanities Workshop, held in conjunction with the 2013 IEEE International Conference on Big Data, on our research on the emotional content of English-language books.

In a previous work we analysed the usage of emotion-related words using the Google Books database. We reported there three main findings:

  1. the existence of distinct periods of positive and negative “moods” detectable trough automatic analysis of the texts.
  2. a steady decrease in the usage of emotion-related words throughout the century.
  3. a divergence between American English and British English books, with the former getting more “emotional” starting from 1960s.

The next step has been to perform additional analysis to check the robustness of these results. In details, we re-run the same analysis with the last (2012) version of the Google Books corpus (which contains approximately 3 millions more books than the one we used originally), we compare the results of different, independent, ways to score the emotional content of the texts (originally we used WordNet-Affect, that now we compare with Linguistic Inquiry and Word Count and “Hedonometer“), we run more detailed statistical analysis (to check the effect of high-frequency mood-words that might determine on their own the trends for specific emotions, obscuring the role of the numerous low-frequency terms), and, finally, we compare our original results with trends obtained by considering only terms tagged as adjectives or adverbs, which are considered reliable indicators of emotional content (Part-Of-Speech information was not present in the first version of the Google corpus).

Overall, we were happy to see that the original results demonstrated to be quite robust (especially results #2 and #3). The next step would be now to understand what they mean – to me, especially interesting is the decrease in the emotional content – assuming that they do not derive from some idiosyncrasy of the Google database. Apparently the official Proceedings of the IEEE Big Data Conference are not around yet, but here you can find a preprint of our contribution (thanks to Bill, coauthor together with Alex Bentley).

Unfortunately I will not be physically in some room in Santa Clara, California, to present my talk. It would have been very interesting for me to get to know more of the “Digital Humanities” world (to me, books are just one kind of artefact useful to study more general cultural dynamics, and it happens they are convenient to quantify, have temporal depth – someone talks, in this regard, of long data), hopefully there will be other occasions. Also my distant-talk will end up to be after 11 pm Bristol-time, and after a Puccini’s La Bohème, so if you, reader, are in the workshop, I apologise in advance…

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s