Movie scripts dataset

I uploaded on figshare (here) a dataset. From the description there:

This dataset contains 1,093 movie scripts collected from the website, each in a separate text file. The file imsdb_sample.txt contains the titles of all movies (corresponding file names are in the form Script_TITLE.txt).

The website was crawled in January 2017. Some scripts are not present as they were missing in or because they were uploaded as pdf files. Please notice that (i) the original scripts were uploaded on the website by individual users, so that they might not correspond exactly to the movie scripts and typos may be present; (ii) html formatting was not consistent in the website, and so neither is the formatting of the resulting text files.

Even considering (i) and (ii), the quality seems good on average and the dataset can be easily used for text-mining tasks.

Continue reading “Movie scripts dataset”

@CultEvoBot retired (for the time being)

Almost three years ago I programmed a simple twitterbot (see here), namely a Python script that was posting every hour, when available, news or blog posts related to cultural evolution – hence the name @CultEvoBot. While the goal of the endeavour was mainly to see how difficult was to build something like that (it was easy!), and to use potentially what I learnt for other projects (I never did, but who knows!), @CultEvoBot was relatively useful and posted links to interesting sources, the majority of the time.

Continue reading “@CultEvoBot retired (for the time being)”

Interesting regularities in human behaviour: older authors write happier books

[Second post of the series “Things that I probably will not develop in a proper paper, but I find interesting enough to write here”. The first is on the XX century decrease of turnover rate in popular culture]

In the last couple of years, part of my research has been dedicated to explore the emotional content of published books, using the  material present in the Google Books Ngram Corpus. Our analysis produced some interesting results. While analysis like ours need to be carefully weighted and possibly re-produced with various samples (but this should happen always…), I think that tools like the Google Books Corpus represent an extraordinary opportunity, as my goal is to study human culture in a scientific/quantitative framework.

Continue reading “Interesting regularities in human behaviour: older authors write happier books”

Detecting cultural transmission biases in real-life dynamics

Many studies of cultural evolution have focused on how transmission biases affect the likelihood of cultural traits of being transmitted. The concepts are quite intuitive. An useful distinction is between content biases, when the intrinsic features of a cultural trait make it more likely to spread (the effectiveness of a tool may be a content bias, but also a sexual hint in an image), and context biases, when the likelihood is determined by the context, as when we tend to dress as our friends/coworkers (conformist bias; but one can do the opposite, and prefer unpopular cultural traits), or as when I was trying to have a young Axl Rose haircut (prestige bias – see also my picture on the left).

Some interesting works in cultural evolution have examined, with analytical and simulation models, the adaptivity of transmission biases (e.g. did my Axl Rose haircut make me rich and/or attractive? It did not, but, on average, prestige bias may be useful) or examined the transmission biases long-trend dynamics in idealised situations (e.g. how fast will a new cultural trait “invade” a population of conformist individuals? or of anti-conformists? etc.). Other works investigated, in controlled experiments, if people are indeed subject to transmission biases when copying from others (they are, with caveat).

What is partly missing is an understanding of the impact of transmission biases in real life cultural dynamics. We recently had a paper accepted in Evolution and Human Behavior that tackles this problem.  In brief (much more is in the paper!): (1) we focused on the turnover of popular traits, i.e. on how many new traits enter in a top-list of a certain size for a certain cultural domain (like here); (2) we used some predictions on how the turnover would look like  if there were no biases, that is, if everybody would just copy others at random (neutral model of cultural evolution); and (3) we showed how these predictions differ if biases are instead present.


The turnover of some cultural domains, for example recent baby names in USA, looks like the red line in the figure above, signalling that people tend to prefer relatively un-common names. The turnover of others, like early baby names, musical preferences of users who subscribed to genre-based groups (“80s Gothic”, “Acid Jazz”) , or usage of colour terms in English language books, looks instead like the blue line, signalling a conformist bias, or a content-based bias (which I call “attraction”).

Overall, turnover can be calculated when we have periodical top lists, or, more generally, when we can “count” the frequency of items trough time. Given the ubiquity of this kind of information in digital form, one can use this methodology to infer individual behaviour from population-level, aggregate, data for several cultural domains.

Acerbi, A. and Bentley, R.A. (2014), Biases in cultural transmission shape the turnover of popular traits, Evolution and Human Behavior, in press.

Normalization biases in Google Ngram

The amount of books present in the Google Ngram database varies considerably trough years. As the plot below shows, even considering only the last century, the number of words increases roughly tenfold between 1900 and 2000.


In the Google Ngram Viewer (and in the 2011 Science paper that introduced the Culturomics project) words frequency is obtained by normalizing the word count with the total number of words for each year (tot-normalization). Others (for example Bentley et al. 2012, Acerbi et al. 2013) preferred to normalize using the yearly count of the word “the” (the-normalization). The rationale (let’s call it the “recent-trash” argument) is that the raw number of words would be affected by the influx of data, special characters, etc. that may have increase in recent books. On the contrary, the word “the” would be a good representative of “real” writing and “real” sentences. If this is correct, we would expect tot-normalized words being biased towards a decrease in frequency in recent years.

Overall, the total number of words and the number of “the” is, not surprisingly, strongly correlated (a visualisation is provided below – each data point represents one year from 1900 to 2000 included), but, again not surprisingly, small differences exist. Most importantly, the differences are small but consistent: for example, in recent years, the count of “the” is consistently lower (in proportion) than the total count of words (as the upper right corner of the plot shows). This is indeed what would be expected according to the recent-trash argument. However the question is: what is the influence of these differences?


To try to answer to this question, I re-analysed some data I had collected to test the new (July 2012) version of the database. In short, I extracted 100 random words from the Part of Speech databasestemmed them (but the results are the same for the not-stemmed words) and searched for those words in the Google Ngram database, limiting the search from 1900 to 2000. I repeated this operation 100 times (making a total of 10,000 random words searches). I tried both normalizations: the plots below show the same 100 repetitions (averaged) for the tot-normalization (left) and the the-normalization (right).


Even at visual inspection (you can click on the image for a larger version) seems quite clear that the frequencies of the same words tend to decrease in the case of the tot-normalization and to increase in the case of the the-normalization. If we average the repetitions the effect is more striking (I also z-transformed the data to have the same scale in the two plots, but this does not change the trends).


If I am not missing something, this confirms the recent-trash argument (words do tend to decrease in frequency when tot-normalised) but it also shows that with the the-normalization the opposite problem is present, that is, words frequency artificially increases in recent years. We have a few ideas to explain why this should be the case, but they need to be tested.

These biases do not represent a problem when comparing trends of words (like “Sherlock Holmes” VS “Frankenstein”) as long as they are normalized in the same way (obviously). However, if one takes a single word, or, especially, a set of words semantically related (e.g. words associated to emotions, religion, economy, etc.) to analyse their “absolute” trends, the normalization might create unwanted effects. One possibility to avoid this is to compare them with trends from random words normalized in the same way (as we did in our recent paper, showing a general decrease in the use of words related to emotions, with the exception of words associated to “fear”).


Michel et al, 2011, Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 331 (6014) 

Bentley et al, 2012, Word Diffusion and Climate Science, PLoS ONE, 7 (11)

Acerbi A, Lampos B, Garnett P, Bentley RA, 2013, The Expression of Emotions in 20th Century Books, PLoS ONE, 8 (3)