From time to time we publish publications of our staff, or students that during their training wrote a deep analysis and report of their project’s findings.
Could you train an algorithm to take in a text and say how funny it is? When I started Israel Tech Challenge’s Data Science concentration, I might have thought that was a joke – but after just eight weeks, that was the task I set myself for my capstone Machine Learning project. After learning about the theoretical background and practice of many aspects of ML, we were abruptly thrust into the real world! We would have to build an end-to-end machine learning project from scratch: from data handling to exploration, feature engineering to modelling, and everything else. I scoured the world of open data (using this great list of hundreds of datasets) and settled on a fun challenge: using Natural Language Processing, I would try to answer the question “what is funny?”
My dataset came from the New Yorker caption contest. Each week since 2005, the New Yorker magazine has published an uncaptioned cartoon on the back page, challenging readers to supply a funny caption, attracting roughly 5,000 weekly submissions.
(See the example above.) Despite many brilliant submissions, I’d never yet managed to win this coveted prize, and it was starting to look less and less likely. My idea? Replace the ivory-tower of cartoon editor judgment with a scientific, quantifiable, replicable score – enter the FunnyScore™.
The data collection for this project was done by the New Yorker itself. On the submission page, the magazine offers readers the opportunity to rank how funny other people’s captions are (using software called NEXT1 , originally developed to test how genes express related diseases), and uploads the results to a GitHub repository. In addition to lightening the workload of submissions for the editors, this provides labelled captions for training an algorithm, reverse engineering the competition and earning undying glory.
Gleaning insight from this corpus is not a trivial challenge. For example, below is an image taken from a recent contest.
A well-trained neural network might detect some of the following semantic attributes in this image: matryoshka dolls, Russian women, police, detectives, lineup. Three captions a user is asked to vote on are2:
“Looks like an inside inside inside inside inside job.”
Ryan Spiers, San Francisco, Calif.
“What makes you think the Russians were involved?”
Andrew Ward, Swarthmore, Pa.
“We gotta get the short one to open up.”
Steve Everhart, Tyrone, Pa.
The first caption uses an attribute implicit in an object present in the image, but not one of the image’s first-order components. The second caption makes puckish reference to current events, while the third relies on a subtle pun. Clearly, explaining why this is funny would be difficult for an algorithm (as even some humans might struggle!).
Nevertheless, some have already pondered what heuristics might be used in this competition, and in addition to the extensive body of work on linguistic features of humour, some scholars have already taken the opportunity to analyse the New Yorker’s corpus for insights specific to this data3. In 2015, The Verge built a caption generator for this contest by taking all of the previous finalists and running them through a Markov generator. The results, seen here, are far from perfect.
There is an even simpler place to begin. Previous winner Patrick House described his tips in an article in Slate. Among the simplest are:
Use common, simple, monosyllabic words. Steer clear of proper nouns…If you must use proper nouns, make them universally recognizable…Excepting first names, only nine proper nouns have ever appeared in a winning caption.
These features are easy to formulate and examine using Natural Language Processing (NLP), and gives us a sense of a good “rules-based” baseline AI to measure against more sophisticated algorithms.
After making friends with Git’s Large File Storage (LFS) system and downloading the data, I began some exploratory data analysis. To simplify the project, I decided to focus purely on the problem of NLP, and not to work on the image-based aspect of the contest. I combined all of the separate contests into one large Pandas dataframe, calculated relevant statistics (more on that in a minute), and saved the output to a csv (as keeping the full list in memory as calculations caused noticeable lag in response time).
For each caption, I calculated the following features:
- Readability, using both the Flesch-Kincaid and Automated Readability Index (ARI) scores,
- Rank (among its peers for the same contest),
- Weighted sum of “funny” and “somewhat funny” votes,
And several linguistic features, using the TextBlob library, such as the number of:
- Auxiliary verbs (be, have, do, could…),
- Indefinite articles (a, an),
- Negation words (no, not),
- Proper nouns,
- Questions words,
As well as a sentiment score, using TextBlob’s built-in sentiment analysis. Below are some examples of code and Jupyter notebook cells.
Data Exploration, Approaches, and Challenges
The bulk of the captions were rated as one of three categories: unfunny, somewhat funny, or funny, with a minority being rated using a 1-vs-1 “which is funnier?” question, referred to internally as “duelling bandits”4 (Captions are selected using the “multi-armed bandit” algorithm, in which all captions start with equal probability of being selected and slowly those with higher scores are displayed more often. The algorithm iteratively attempts to maximise return and minimise regret. For details of the specific algorithm, see “lil’ UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits”). I decided to focus on the former, and treat it as a classification problem, with a caption being classified in the class with the highest number of votes.
Some of the basic operations of text pre-processing are: removing common stop words and punctuation, stemming and lemmatization. (The latter two are forms of reducing inflected forms of words into a simpler common form: “walks” and “walking’ to “walk”, for example.) I combined all of the captions into one text, and looked for common words and bigrams (“trickle-down economics”, “doctor’s orders”, and “candid camera” were all among the top bigrams).
After separating the captions by image, I found examples of recurring jokes. One example will suffice:
A common trigram is “Mom’s apple pie”.
Also common is “loophole”, and “in the oven” (as in a “bun in the oven”):
One way to visualise the recurring “themes” of each caption is via a wordcloud6
Still, I wanted to use Machine Learning in a predictive capacity. Could I train an algorithm to identify a “funny” caption, using the linguistic features that I had labelled?
The first problem I had to deal with was the issue of imbalanced data. Put bluntly, it’s hard to be funny – and now we have data to back this up.
Picking a Metric In the above chart, the green line shows how the number of ‘unfunny’ votes was highest for almost every caption. This leads to what has been termed the accuracy paradox: it would be possible to get a high accuracy score with a classifier that judged every caption to be unfunny, since so many of the captions garnered this score. One way of dealing with this is to use alternative measures such as precision and recall7, measuring respectively the.selected items that are relevant, and the number of relevant items that were selected.
But in order to train an algorithm to learn the features of a good caption, it may not be very helpful to train it on so many bad ones8. Two possible responses to this are upsampling (reusing data from the majority class/es), downsampling (dropping data from the majority class/es), and weighting samples before training. Each method has its advantages and pitfalls, much discussed in the literature. I made a quick call to downsample for the sake of saving time, sacrificing some of the richness of the dataset. I ended up with 73,184 captions, half ‘good’ (the aggregate FunnyScore™ greater than 0.2) and half ‘bad’. I then randomly selected 20% of these as my test set.
I also decided to frame the task as a regression problem (with the Mean Squared Error metric, the sum of the squared deviation between the prediction and the actual answer), in which I used the score I’d assigned each caption, aggregating the number of “funny” ratings and 0.5 * the “somewhat funny” in order to be able to use more of the data.
Baseline Model. In the beginning of a Data Science project, it’s always a good idea to get a baseline prediction: one which, while not requiring too much work, gives us a rough estimate of what we could achieve using a simple method. By measuring how much better the final score is against the baseline, the data scientist can get a sense of how much value her method is adding. I used linear regression (an essential part of every data scientist’s toolkit, which posits a linear relationship with the target variable) as my baseline, and saved the FunnyScores™ for later comparison.
Hyperparameter optimisation. I then proceeded to decision tree methods (such as a single decision tree, a Random Forest, and the currently popular XGboost) and tried to find good parameters. When your parameter dimensionality isn’t prohibitively large, searching for good hyperparameters of an algorithm is basically about taking an educated guess and just trying lots of options around it, while using a consistent metric. Happily, scikit-learn includes a “grid search” hyperparameter optimisation feature, which allows the user to plug in a general range and runs the algorithm multiple times with a matrix of possible parameters, attempting to optimise a given success metric. After running the grid search on five parameters (see the code below), I was satisfied that I had “good enough” parameters, and was ready to compare the output of my models9.
I found that the XGBoost algorithm got the best result: it was able to predict a caption’s score with a MSE of just .01266, beating out Linear Regression (.012712) and Random Forest (.013027). The most important features were:
- Proper nouns (confirming Patrick House’s claim)
- Length of caption (in characters)
The Random Forest algorithm, while ranking the categories fairly similarly, also added average word length. Since the underlying standard deviation of the dataset was 0.49, this result is clearly much better than a string of random guesses. While I did not have time to actually try to optimize the parameters, the fact that the performance of these different methods was similar meant that for this problem the more critical step was in coming up with the right features and formulating the problem properly.
Conclusions and Further Ideas
What this means in practice is that, using only our knowledge of the words of a caption, we are able to predict its aggregate score with a reasonably small error rate. We can also use this to test our own caption ideas, to get an idea of how they will be received by the community of users who rate New Yorker captions online. We can get a sense of what the decision tree-based models considers important using the ‘feature_importances’ method in scikit-learn. (I had also tried to use an explainer algorithm called LIME10 (a simple implementation, lime, is available in PyPi), which attempts to demystify black-box algorithms. However, I quickly realised trying to pinpoint the “funny” words in a sentence is a pointless task – the humour is an emergent property.)
Owing to the intense time constraints of the final project, there were many other features I wasn’t able to implement. For example, I had hoped to add another feature: presence of spelling errors, or their count. The principle behind a spellchecker is simple: Peter Norvig once wrote a basic Python implementation on a plane flight!11 The most naïve model would simply involve checking each word for presence in a predefined vocabulary of correct English words. However, the presence of many slang words, contractions and worst of all, proper nouns, in the dataset meant that checking for errors without a laboriously built custom vocabulary was beyond the scope of the project.
An idea I had more success with was the pulling topics from news events. By adding timestamps for each caption as a feature, I planned to use correlation with current events as an additional variable (perhaps adding a binary ‘refers to current event’ feature), since jokes about current events might increase the perceived funniness of a caption. There are no shortage of APIs for pulling news headline data for a historical time period; for example, through Google News, one can do it in just one line in the terminal:
However, in practice this proved more complicated than expected. News headline APIs required more processing in order to be distilled into major ‘named entities’, while I found alternative ‘news topics’ APIs to contain too little information to be valuable. I wasn’t able to find much correlation between caption contents and news topics, though I suspect that some more fine-grained NLP engineering would have brought connections to light. I did end up creating a plotting function to visualise the frequency of common named entities within the dataset over time – in the below example I plot the predictable spike in captions related to the 2016 US presidential election (such as ‘Trump’, ‘Hillary’, ‘fake news’, and ‘Russia’ – perhaps unsurprisingly, Trump features by far the most).
This project was both entertaining and educational. In data science, there is simply no substitute for getting your hands dirty working with an actual dataset. And while deploying my first complete machine learning project, I also gained some tips on how I might finally win the caption contest…
Acknowledgements: thanks to ITC Data Science lead Luis Voloch, who read over this post and provided helpful suggestions; and to Danielle Cohen and the whole ITC team for all their support throughout the program.
- Originally developed at the University of Wisconsin to test how certain genes expressed related diseases. The caption contest allowed the team to test-run the algorithm, according to cartoons editor Emma Allen.
- This was accessed on 3/10/2018 at 15:50. The captions displayed change on each pageload (in addition to the weekly change of contest), based on the lilUCB algorithm – see below.
- See Shahaf et al., Inside Jokes: Identifying Humorous Cartoon Captions. Also see this WIRED article on the work of a team from the University of Colorado.
- Since captions are selected using the “multi-armed bandit” algorithm, in which all captions start with equal probability of being selected and slowly those with higher scores are displayed more often. The algorithm iteratively attempts to maximise return and minimise regret. For details of the specific algorithm, see “lil’ UCB : An Optimal Exploration Algorithm for Multi-Armed Bandits”; for a simple explanation and implementation of an MAB algorithm in Python, see here.
- Later on I also used spaCy, which streamlines a lot of these tasks right ‘out of the box’.
- I used Andreas Muller’s implementation, found here and on PyPi.
- Or the F1 score that combines both
- In the wild, a lot of data may be like this. A workshop held during ITC really drove home the problem for me. For example, in the ad-tech industry, the click-through rates are very low numbers, so when measuring the effectiveness of different ad campaigns it is necessary to look at the campaigns that were successful. Looking at the entire data set will result in the “signal” (useful ads) being swamped by the “noise”.
- This project was done under fairly intense time constraints, in gaps between our regular classes, and the point was mainly educational – hence my decision not to spend too much time on hyperparameter optimization.
- Or Local Interpretable Model-Agnostic Explanations, from Ribeiro et al.’s paper here. A simple implementation, lime, is available in PyPi.
- In this valuable post, he shares the code and some of the probability theory behind it.