Books Average Previous Decade of Economic Misery

Almost one year ago, we published a paper in which we described a large scale analysis of cultural/literary trends, realised using the google books ngram corpus. In particular, we showed that, trough a relatively simple extraction of emotion-realted words (words semantically related to “main” emotions like joy, sadness, anger, etc.), it was possible to detect some clear tendencies, such as a general decline in the emotional “tone” of books published in the twentieth century – or at least in the frequencies of emotions words -, a divergence between American and British English – with the former being more emotional -, and, finally, the existence of distinct periods of “literary mood” in the last century.

Related to the last point, PLOS ONE just published a follow up of this research, in which we correlate this literary mood with the past century economic trend. The image below shows the main point of our study.

misery

The red line is what we called “Literary Misery Index” (how “sad” are books in a certain year, on average), that we extracted from the books in the Google Corpus, while the black line is a 11-years moving average of the economic Misery index (how “bad” is economy in a certain year), a well-known economic index, realised adding inflation and unemployment rates. The two trends are strongly correlated (you can read more in the Bristol University press release here, and, of course, in the original paper).

As for the previous work, we are glad we had some media attention (see for example The New York Times and The Guardian), which generated quite a lot of buzz. Not surprisingly, this included some criticism. It is interesting that, while some commenters think that we are “stating the obvious”, others accuse us to apply  a “crude” causal determinism, and to defend the implausible claim that economy “dictates” literature and culture.

To me, I am more sympathetic to the state-the-obvious side of the debate so I am not going to write on this (but: we are able to substantiate an “obvious” claim – economic conditions influence cultural mood – with empirical data, as well as provide some refinement, for example providing a possible estimate of a time lag). Regarding the other side of the debate, I would not say that economy “dictates” literature, but it is quite plausible that economic conditions may have an effect on mood. This is not just common sense: many studies link, for example, financial strain and depressive symptoms (here), or general psychological distress (here). If the google corpus is a good barometer of a culture mood, our results are not particularly surprising. This does not mean of course that all books published, for example, in the 80s, were gloomy (I feel like I am underestimating the intelligence of the readers, but some journalists seem to criticise our result on this shaky basis), or that economy alone has a causal effect on literature or culture.

On a related note, given that I can safely assume that most of the “crude determinism” critics come from literary, or, in general humanistic, departments: I like to imagine that a well-known German philosopher, that once was very praised in there, would be very supportive of our work!

KarlMarx

Reference

Bentley R.A., Acerbi A., Ormerod P., Lampos V., (2014), Books Average Previous Decade of Economic MiseryPLoS ONE, 9 (1): e83147.

“Happiness” in 20th Century English books

Just to give an idea of the analysis mentioned in the previous post, the plot below shows the trend for a rough measure of the “happiness” of the books present in the Google Books database. For WordNet-Affect (WNA) this is obtained, simplifying a little, by subtracting the cumulative scores of the categories of “Joy” and “Sadness”, while for Linguistic Inquiry and Word Count (LIWC) the two (equivalent) categories are called “Positive emotions” and, again, “Sadness”. Values above zero indicate generally ‘happy’ periods, and values below the zero indicate generally ‘sad’ periods.

fig1

This result is interesting for me not much because we can discover something new about the last century (even though I wonder why the 80s seems to be so sad), but because if (i) two independent ways to score the emotional content of texts (ii) trough a quite rough analysis of (iii) an enormous database of books, give highly correlated trends, this means that there is a meaningful “signal” that we can extract (which can not be taken for granted).

We also performed an analogous analysis using a tool called “Hedonometer“ (HED – see the plot below). In this case the results are quite different, even though some similarities are present, e.g. the 20s positive peak, the negative peak corresponding to Second World War, the post-80s increase in happiness. The reason is probably that LIWC and WNA are conceptually quite different from HED. LIWC and WNA are basically “lists” of words related to specific emotions (so, for example, the first – alphabetically – 5 words in LIWC’s category of “Sadness” are: abandon*, ache*, aching, agoniz*, agony), while HED uses a list of generic words not directly related to emotional states, but evaluated by human subjects as particularly happy or sad. So, for example, HED scores in texts the presence of words such as “terrorism” or “Christmas”.

fig2

One interesting things to notice regarding HED is that it is the only index that “tracks” the effect of the First World War. Also, comparing the absolute values of our results (the right y-axis in the plot above) with the the values obtained for contemporary twitter messages (see here), it seems that, in general, books tend to be slightly more “sad” than tweets.

If you are interested in more details, and in the other analysis, the preprint of our contribution can be found here.

Robustness of emotion extraction from 20th Century English books

I’ll give today a short talk at the Big Humanities Workshop, held in conjunction with the 2013 IEEE International Conference on Big Data, on our research on the emotional content of English-language books.

In a previous work we analysed the usage of emotion-related words using the Google Books database. We reported there three main findings:

  1. the existence of distinct periods of positive and negative “moods” detectable trough automatic analysis of the texts.
  2. a steady decrease in the usage of emotion-related words throughout the century.
  3. a divergence between American English and British English books, with the former getting more “emotional” starting from 1960s.

The next step has been to perform additional analysis to check the robustness of these results. In details, we re-run the same analysis with the last (2012) version of the Google Books corpus (which contains approximately 3 millions more books than the one we used originally), we compare the results of different, independent, ways to score the emotional content of the texts (originally we used WordNet-Affect, that now we compare with Linguistic Inquiry and Word Count and “Hedonometer“), we run more detailed statistical analysis (to check the effect of high-frequency mood-words that might determine on their own the trends for specific emotions, obscuring the role of the numerous low-frequency terms), and, finally, we compare our original results with trends obtained by considering only terms tagged as adjectives or adverbs, which are considered reliable indicators of emotional content (Part-Of-Speech information was not present in the first version of the Google corpus).

Overall, we were happy to see that the original results demonstrated to be quite robust (especially results #2 and #3). The next step would be now to understand what they mean – to me, especially interesting is the decrease in the emotional content – assuming that they do not derive from some idiosyncrasy of the Google database. Apparently the official Proceedings of the IEEE Big Data Conference are not around yet, but here you can find a preprint of our contribution (thanks to Bill, coauthor together with Alex Bentley).

Unfortunately I will not be physically in some room in Santa Clara, California, to present my talk. It would have been very interesting for me to get to know more of the “Digital Humanities” world (to me, books are just one kind of artefact useful to study more general cultural dynamics, and it happens they are convenient to quantify, have temporal depth – someone talks, in this regard, of long data), hopefully there will be other occasions. Also my distant-talk will end up to be after 11 pm Bristol-time, and after a Puccini’s La Bohème, so if you, reader, are in the workshop, I apologise in advance…

Normalization biases in Google Ngram

The amount of books present in the Google Ngram database varies considerably trough years. As the plot below shows, even considering only the last century, the number of words increases roughly tenfold between 1900 and 2000.

totWords

In the Google Ngram Viewer (and in the 2011 Science paper that introduced the Culturomics project) words frequency is obtained by normalizing the word count with the total number of words for each year (tot-normalization). Others (for example Bentley et al. 2012, Acerbi et al. 2013) preferred to normalize using the yearly count of the word “the” (the-normalization). The rationale (let’s call it the “recent-trash” argument) is that the raw number of words would be affected by the influx of data, special characters, etc. that may have increase in recent books. On the contrary, the word “the” would be a good representative of “real” writing and “real” sentences. If this is correct, we would expect tot-normalized words being biased towards a decrease in frequency in recent years.

Overall, the total number of words and the number of “the” is, not surprisingly, strongly correlated (a visualisation is provided below – each data point represents one year from 1900 to 2000 included), but, again not surprisingly, small differences exist. Most importantly, the differences are small but consistent: for example, in recent years, the count of “the” is consistently lower (in proportion) than the total count of words (as the upper right corner of the plot shows). This is indeed what would be expected according to the recent-trash argument. However the question is: what is the influence of these differences?

TotVSThe

To try to answer to this question, I re-analysed some data I had collected to test the new (July 2012) version of the database. In short, I extracted 100 random words from the Part of Speech databasestemmed them (but the results are the same for the not-stemmed words) and searched for those words in the Google Ngram database, limiting the search from 1900 to 2000. I repeated this operation 100 times (making a total of 10,000 random words searches). I tried both normalizations: the plots below show the same 100 repetitions (averaged) for the tot-normalization (left) and the the-normalization (right).

allAverages

Even at visual inspection (you can click on the image for a larger version) seems quite clear that the frequencies of the same words tend to decrease in the case of the tot-normalization and to increase in the case of the the-normalization. If we average the repetitions the effect is more striking (I also z-transformed the data to have the same scale in the two plots, but this does not change the trends).

zScores

If I am not missing something, this confirms the recent-trash argument (words do tend to decrease in frequency when tot-normalised) but it also shows that with the the-normalization the opposite problem is present, that is, words frequency artificially increases in recent years. We have a few ideas to explain why this should be the case, but they need to be tested.

These biases do not represent a problem when comparing trends of words (like “Sherlock Holmes” VS “Frankenstein”) as long as they are normalized in the same way (obviously). However, if one takes a single word, or, especially, a set of words semantically related (e.g. words associated to emotions, religion, economy, etc.) to analyse their “absolute” trends, the normalization might create unwanted effects. One possibility to avoid this is to compare them with trends from random words normalized in the same way (as we did in our recent paper, showing a general decrease in the use of words related to emotions, with the exception of words associated to “fear”).

References

Michel et al, 2011, Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 331 (6014) 

Bentley et al, 2012, Word Diffusion and Climate Science, PLoS ONE, 7 (11)

Acerbi A, Lampos B, Garnett P, Bentley RA, 2013, The Expression of Emotions in 20th Century Books, PLoS ONE, 8 (3)

The Expression of Emotions in 20th Century Books

We just published a paper in the journal PLoS ONE in which we analysed the usage of words with emotional content in English-language books, using the enormous database provided by Google Books (the version we used contains more than 5 millions books).

We found, for example, that there is a general, steady, decrease in the usage of words with emotional content throughout the last century, with the interesting exception of words associated to “fear”, that have an opposite trend starting from the 70s. Also we found that American and British books are quite different in their trends regarding emotional content, with American being more ’emotional’ than British. Perhaps surprisingly, this divergence is only observable from the 60s, while, before, books in the two variants of English language showed pretty much the same  trends.

These findings resonate well with the popular narrative, but it is great (at least from my quantitative-scientific-minded-anthropological point of view) that we can support it with data, and that we will be able to use those data to dig further into it. Of course many big questions are open: for example, we don’t know what caused those changes – but hopefully our results could provide a starting point to study this – and we don’t know what is the relationships between changes in books and broader cultural changes.  My hope is that, given the amount of data, and the fact the Google Books is not explicitly biased towards successful or influent books, we may be able to detect genuine long-term cultural changes more than ‘literary’ ones.

The paper is open access and can be found here. We had quite a few press interest: articles not surprisingly varied from enthusiastic to skeptical, from accurate and scientifically sound to sort-of sensationalist (and I learn a new British word: boffin) but overall I am happy with what happened. Philip Ball wrote a great (and ‘neat’) piece for Nature about our work, and I was quickly interviewed by Adam Rutherford for the BBC Radio4 science programme Material World.

References

Acerbi A., Lampos V., Garnett P., Bentley R. A. (2013), The expression of emotions in 20th century booksPLOS ONE, 8(3), e59030