Unraveling the Capabilities of Language Models in News Summarization
Abdurrahman Odabaşı, Göksel Biricik
2025-02-03

Summary
This paper talks about testing different AI language models to see how well they can summarize news articles. The researchers looked at 20 different models, especially focusing on smaller ones, to see which ones could create good summaries without needing a lot of training examples.
What's the problem?
As new AI language models are being created, it's important to know how well they can handle tasks like summarizing news articles. The problem is that we don't have a clear picture of how these different models perform, especially the smaller ones that might be more practical for everyday use. Also, the researchers wanted to see if giving the models a few examples (few-shot learning) would help them do better than just asking them to summarize without any examples (zero-shot learning).
What's the solution?
The researchers tested 20 different AI models on three different sets of news articles. They used several ways to judge how good the summaries were, including automatic scoring, human evaluation, and even using another AI to judge the quality. They tried both zero-shot and few-shot approaches to see which worked better. Surprisingly, they found that giving examples didn't always help and sometimes made the summaries worse. They discovered this was often because the example summaries used for training weren't very good themselves.
Why it matters?
This research matters because it helps us understand which AI models are best at summarizing news, which is a task that could save people a lot of time in their daily lives. It shows that some smaller models can do almost as well as the big, famous ones like GPT-3.5 and GPT-4. This is important because smaller models are often easier and cheaper to use. The study also reveals that the way we train these models (with or without examples) can have unexpected effects, which is crucial information for improving AI technology in the future. Overall, this research helps guide the development of better, more efficient AI tools for handling and summarizing information.
Abstract
Given the recent introduction of multiple language models and the ongoing demand for improved Natural Language Processing tasks, particularly summarization, this work provides a comprehensive benchmarking of 20 recent language models, focusing on smaller ones for the news summarization task. In this work, we systematically test the capabilities and effectiveness of these models in summarizing news article texts which are written in different styles and presented in three distinct datasets. Specifically, we focus in this study on zero-shot and few-shot learning settings and we apply a robust evaluation methodology that combines different evaluation concepts including automatic metrics, human evaluation, and LLM-as-a-judge. Interestingly, including demonstration examples in the few-shot learning setting did not enhance models' performance and, in some cases, even led to worse quality of the generated summaries. This issue arises mainly due to the poor quality of the gold summaries that have been used as reference summaries, which negatively impacts the models' performance. Furthermore, our study's results highlight the exceptional performance of GPT-3.5-Turbo and GPT-4, which generally dominate due to their advanced capabilities. However, among the public models evaluated, certain models such as Qwen1.5-7B, SOLAR-10.7B-Instruct-v1.0, Meta-Llama-3-8B and Zephyr-7B-Beta demonstrated promising results. These models showed significant potential, positioning them as competitive alternatives to large models for the task of news summarization.