Detecting cultural transmission biases in real-life dynamics

Many studies of cultural evolution have focused on how transmission biases affect the likelihood of cultural traits of being transmitted. The concepts are quite intuitive. An useful distinction is between content biases, when the intrinsic features of a cultural trait make it more likely to spread (the effectiveness of a tool may be a content bias, but also a sexual hint in an image), and context biases, when the likelihood is determined by the context, as when we tend to dress as our friends/coworkers (conformist bias; but one can do the opposite, and prefer unpopular cultural traits), or as when I was trying to have a young Axl Rose haircut (prestige bias – see also my picture on the left).

Some interesting works in cultural evolution have examined, with analytical and simulation models, the adaptivity of transmission biases (e.g. did my Axl Rose haircut make me rich and/or attractive? It did not, but, on average, prestige bias may be useful) or examined the transmission biases long-trend dynamics in idealised situations (e.g. how fast will a new cultural trait “invade” a population of conformist individuals? or of anti-conformists? etc.). Other works investigated, in controlled experiments, if people are indeed subject to transmission biases when copying from others (they are, with caveat).

What is partly missing is an understanding of the impact of transmission biases in real life cultural dynamics. We recently had a paper accepted in Evolution and Human Behavior that tackles this problem.  In brief (much more is in the paper!): (1) we focused on the turnover of popular traits, i.e. on how many new traits enter in a top-list of a certain size for a certain cultural domain (like here); (2) we used some predictions on how the turnover would look like  if there were no biases, that is, if everybody would just copy others at random (neutral model of cultural evolution); and (3) we showed how these predictions differ if biases are instead present.


The turnover of some cultural domains, for example recent baby names in USA, looks like the red line in the figure above, signalling that people tend to prefer relatively un-common names. The turnover of others, like early baby names, musical preferences of users who subscribed to genre-based groups (“80s Gothic”, “Acid Jazz”) , or usage of colour terms in English language books, looks instead like the blue line, signalling a conformist bias, or a content-based bias (which I call “attraction”).

Overall, turnover can be calculated when we have periodical top lists, or, more generally, when we can “count” the frequency of items trough time. Given the ubiquity of this kind of information in digital form, one can use this methodology to infer individual behaviour from population-level, aggregate, data for several cultural domains.

Acerbi, A. and Bentley, R.A. (2014), Biases in cultural transmission shape the turnover of popular traits, Evolution and Human Behavior, in press.

Normalization biases in Google Ngram

The amount of books present in the Google Ngram database varies considerably trough years. As the plot below shows, even considering only the last century, the number of words increases roughly tenfold between 1900 and 2000.


In the Google Ngram Viewer (and in the 2011 Science paper that introduced the Culturomics project) words frequency is obtained by normalizing the word count with the total number of words for each year (tot-normalization). Others (for example Bentley et al. 2012, Acerbi et al. 2013) preferred to normalize using the yearly count of the word “the” (the-normalization). The rationale (let’s call it the “recent-trash” argument) is that the raw number of words would be affected by the influx of data, special characters, etc. that may have increase in recent books. On the contrary, the word “the” would be a good representative of “real” writing and “real” sentences. If this is correct, we would expect tot-normalized words being biased towards a decrease in frequency in recent years.

Overall, the total number of words and the number of “the” is, not surprisingly, strongly correlated (a visualisation is provided below – each data point represents one year from 1900 to 2000 included), but, again not surprisingly, small differences exist. Most importantly, the differences are small but consistent: for example, in recent years, the count of “the” is consistently lower (in proportion) than the total count of words (as the upper right corner of the plot shows). This is indeed what would be expected according to the recent-trash argument. However the question is: what is the influence of these differences?


To try to answer to this question, I re-analysed some data I had collected to test the new (July 2012) version of the database. In short, I extracted 100 random words from the Part of Speech databasestemmed them (but the results are the same for the not-stemmed words) and searched for those words in the Google Ngram database, limiting the search from 1900 to 2000. I repeated this operation 100 times (making a total of 10,000 random words searches). I tried both normalizations: the plots below show the same 100 repetitions (averaged) for the tot-normalization (left) and the the-normalization (right).


Even at visual inspection (you can click on the image for a larger version) seems quite clear that the frequencies of the same words tend to decrease in the case of the tot-normalization and to increase in the case of the the-normalization. If we average the repetitions the effect is more striking (I also z-transformed the data to have the same scale in the two plots, but this does not change the trends).


If I am not missing something, this confirms the recent-trash argument (words do tend to decrease in frequency when tot-normalised) but it also shows that with the the-normalization the opposite problem is present, that is, words frequency artificially increases in recent years. We have a few ideas to explain why this should be the case, but they need to be tested.

These biases do not represent a problem when comparing trends of words (like “Sherlock Holmes” VS “Frankenstein”) as long as they are normalized in the same way (obviously). However, if one takes a single word, or, especially, a set of words semantically related (e.g. words associated to emotions, religion, economy, etc.) to analyse their “absolute” trends, the normalization might create unwanted effects. One possibility to avoid this is to compare them with trends from random words normalized in the same way (as we did in our recent paper, showing a general decrease in the use of words related to emotions, with the exception of words associated to “fear”).


Michel et al, 2011, Quantitative Analysis of Culture Using Millions of Digitized Books, Science, 331 (6014) 

Bentley et al, 2012, Word Diffusion and Climate Science, PLoS ONE, 7 (11)

Acerbi A, Lampos B, Garnett P, Bentley RA, 2013, The Expression of Emotions in 20th Century Books, PLoS ONE, 8 (3)