I found, thanks to twitter-induced serendipity (others call it procrastination), the lyrics of the songs included in the annual Billboard Top-100 from 1965 to 2015 (i.e., considering a few missing, ~5,000 songs). You can find in GitHub, together with the raw data, some clarifications on how the data were collected, their limitations, etc. plus a pointer to a nice analysis already done.
Just to give an idea of the analysis mentioned in the previous post, the plot below shows the trend for a rough measure of the “happiness” of the books present in the Google Books database. For WordNet-Affect (WNA) this is obtained, simplifying a little, by subtracting the cumulative scores of the categories of “Joy” and “Sadness”, while for Linguistic Inquiry and Word Count (LIWC) the two (equivalent) categories are called “Positive emotions” and, again, “Sadness”. Values above zero indicate generally ‘happy’ periods, and values below the zero indicate generally ‘sad’ periods.
This result is interesting for me not much because we can discover something new about the last century (even though I wonder why the 80s seems to be so sad), but because if (i) two independent ways to score the emotional content of texts (ii) trough a quite rough analysis of (iii) an enormous database of books, give highly correlated trends, this means that there is a meaningful “signal” that we can extract (which can not be taken for granted).
We also performed an analogous analysis using a tool called “Hedonometer“ (HED – see the plot below). In this case the results are quite different, even though some similarities are present, e.g. the 20s positive peak, the negative peak corresponding to Second World War, the post-80s increase in happiness. The reason is probably that LIWC and WNA are conceptually quite different from HED. LIWC and WNA are basically “lists” of words related to specific emotions (so, for example, the first – alphabetically – 5 words in LIWC’s category of “Sadness” are: abandon*, ache*, aching, agoniz*, agony), while HED uses a list of generic words not directly related to emotional states, but evaluated by human subjects as particularly happy or sad. So, for example, HED scores in texts the presence of words such as “terrorism” or “Christmas”.
One interesting things to notice regarding HED is that it is the only index that “tracks” the effect of the First World War. Also, comparing the absolute values of our results (the right y-axis in the plot above) with the the values obtained for contemporary twitter messages (see here), it seems that, in general, books tend to be slightly more “sad” than tweets.
If you are interested in more details, and in the other analysis, the preprint of our contribution can be found here.