Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning
Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan Lok Zhou, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, Robert Mankoff, Robert Nowak
2024-06-18

Summary
This paper presents a new dataset and evaluation methods for understanding how AI can generate funny captions for cartoons. It includes over 250 million ratings from people on more than 2.2 million captions collected from The New Yorker's cartoon caption contest.
What's the problem?
While AI has made progress in generating text, it often struggles to create humor that resonates with people. Existing methods for training AI to be funny are not always effective, and many AI models do not perform as well as humans when it comes to generating humorous content. This makes it difficult to assess which models are truly good at humor and how they can be improved.
What's the solution?
To address this issue, the authors created a large-scale dataset that includes ratings of cartoon captions from a wide audience. They used this data to benchmark various AI models, including advanced ones like GPT-4, to see how well they could generate funny captions compared to human submissions. The study also proposed new ways to evaluate the quality of these captions by combining human feedback with AI assessments, helping to identify strengths and weaknesses in the models' humor generation capabilities.
Why it matters?
This research is significant because it provides valuable insights into how AI can better understand and generate humor. By creating a comprehensive dataset and evaluation framework, it helps improve the development of AI systems that can produce funny content, which could enhance applications in areas like chatbots, social media, and entertainment. Ultimately, this work aims to bridge the gap between human creativity and machine learning in the realm of humor.
Abstract
We present a novel multimodal preference dataset for creative tasks, consisting of over 250 million human ratings on more than 2.2 million captions, collected through crowdsourcing rating data for The New Yorker's weekly cartoon caption contest over the past eight years. This unique dataset supports the development and evaluation of multimodal large language models and preference-based fine-tuning algorithms for humorous caption generation. We propose novel benchmarks for judging the quality of model-generated captions, utilizing both GPT4 and human judgments to establish ranking-based evaluation strategies. Our experimental results highlight the limitations of current fine-tuning methods, such as RLHF and DPO, when applied to creative tasks. Furthermore, we demonstrate that even state-of-the-art models like GPT4 and Claude currently underperform top human contestants in generating humorous captions. As we conclude this extensive data collection effort, we release the entire preference dataset to the research community, fostering further advancements in AI humor generation and evaluation.