Movie scripts dataset

I uploaded on figshare (here) a dataset. From the description there:

This dataset contains 1,093 movie scripts collected from the website imsdb.com, each in a separate text file. The file imsdb_sample.txt contains the titles of all movies (corresponding file names are in the form Script_TITLE.txt).

The website was crawled in January 2017. Some scripts are not present as they were missing in imsdb.com or because they were uploaded as pdf files. Please notice that (i) the original scripts were uploaded on the website by individual users, so that they might not correspond exactly to the movie scripts and typos may be present; (ii) html formatting was not consistent in the website, and so neither is the formatting of the resulting text files.

Even considering (i) and (ii), the quality seems good on average and the dataset can be easily used for text-mining tasks.

Continue reading “Movie scripts dataset”

@CultEvoBot retired (for the time being)

Almost three years ago I programmed a simple twitterbot (see here), namely a Python script that was posting every hour, when available, news or blog posts related to cultural evolution – hence the name @CultEvoBot. While the goal of the endeavour was mainly to see how difficult was to build something like that (it was easy!), and to use potentially what I learnt for other projects (I never did, but who knows!), @CultEvoBot was relatively useful and posted links to interesting sources, the majority of the time.

Continue reading “@CultEvoBot retired (for the time being)”

Interesting regularities in human behaviour: older authors write happier books

[Second post of the series “Things that I probably will not develop in a proper paper, but I find interesting enough to write here”. The first is on the XX century decrease of turnover rate in popular culture]

In the last couple of years, part of my research has been dedicated to explore the emotional content of published books, using the  material present in the Google Books Ngram Corpus. Our analysis produced some interesting results. While analysis like ours need to be carefully weighted and possibly re-produced with various samples (but this should happen always…), I think that tools like the Google Books Corpus represent an extraordinary opportunity, as my goal is to study human culture in a scientific/quantitative framework.

Continue reading “Interesting regularities in human behaviour: older authors write happier books”

Books Average Previous Decade of Economic Misery

Almost one year ago, we published a paper in which we described a large scale analysis of cultural/literary trends, realised using the google books ngram corpus. In particular, we showed that, trough a relatively simple extraction of emotion-realted words (words semantically related to “main” emotions like joy, sadness, anger, etc.), it was possible to detect some clear tendencies, such as a general decline in the emotional “tone” of books published in the twentieth century – or at least in the frequencies of emotions words -, a divergence between American and British English – with the former being more emotional -, and, finally, the existence of distinct periods of “literary mood” in the last century.

Related to the last point, PLOS ONE just published a follow up of this research, in which we correlate this literary mood with the past century economic trend. The image below shows the main point of our study.

misery

The red line is what we called “Literary Misery Index” (how “sad” are books in a certain year, on average), that we extracted from the books in the Google Corpus, while the black line is a 11-years moving average of the economic Misery index (how “bad” is economy in a certain year), a well-known economic index, realised adding inflation and unemployment rates. The two trends are strongly correlated (you can read more in the Bristol University press release here, and, of course, in the original paper).

As for the previous work, we are glad we had some media attention (see for example The New York Times and The Guardian), which generated quite a lot of buzz. Not surprisingly, this included some criticism. It is interesting that, while some commenters think that we are “stating the obvious”, others accuse us to apply  a “crude” causal determinism, and to defend the implausible claim that economy “dictates” literature and culture.

To me, I am more sympathetic to the state-the-obvious side of the debate so I am not going to write on this (but: we are able to substantiate an “obvious” claim – economic conditions influence cultural mood – with empirical data, as well as provide some refinement, for example providing a possible estimate of a time lag). Regarding the other side of the debate, I would not say that economy “dictates” literature, but it is quite plausible that economic conditions may have an effect on mood. This is not just common sense: many studies link, for example, financial strain and depressive symptoms (here), or general psychological distress (here). If the google corpus is a good barometer of a culture mood, our results are not particularly surprising. This does not mean of course that all books published, for example, in the 80s, were gloomy (I feel like I am underestimating the intelligence of the readers, but some journalists seem to criticise our result on this shaky basis), or that economy alone has a causal effect on literature or culture.

On a related note, given that I can safely assume that most of the “crude determinism” critics come from literary, or, in general humanistic, departments: I like to imagine that a well-known German philosopher, that once was very praised in there, would be very supportive of our work!

KarlMarx

Reference

Bentley R.A., Acerbi A., Ormerod P., Lampos V., (2014), Books Average Previous Decade of Economic MiseryPLoS ONE, 9 (1): e83147.

Meet @CultEvoBot, my first Twitterbot

cultevobot

In the last few days, for independent reasons (i) I was told the Horse_ebooks story (in short, an “artistic” project where humans pretended to be a Twitterbot and gained around 200K followers – but if you don’t know anything about it please read the Wikipedia page and the links cited in the References there, it is quite interesting), (ii) I stumbled upon this page with a few example of Twitterbots worth to follow (at least according to digitaltrends.com), and, finally (iii) I was pointed to this NYTimes article (from August 2013) on social-bots (claiming, among other things, that only 35% of twitter users are humans). This seemed enough to me to try and see how difficult was to set up a Twitterbot.

A Twitterbot is a program that produces automated posts via Twitter (surprise!). In my case, @CultEvoBot is a short python script that every hour – when my laptop is on – uses google news search or google blogs search (after having flipped a coin to decide) and search there for “cultural evolution”. It then goes trough the links proposed and, if one is not in its log file of past links, posts it in twitter with the title provided by google (and adds it in its log file). That’s all (it also follows its followers, which is completely useless at the moment – among other things because I am the only follower – but might be useful in the future).

So basically, @CultEvoBot does not do much more than providing links to potentially interesting sources, still I am pretty satisfied of the result. Programming a Twitterbot – also with more elaborate functions (like answering to specific users or posts, re-tweeting, etc.) – seems quite straightforward, and I can imagine that I will be able to use them in the future for scientific (or artsy) projects, even though at the moment I don’t have any specific idea (suggestions welcome).