Brevity Constraints Reverse Performance Hierarchies in Language Models

MD Azizul Hakim

2026-04-02

Brevity Constraints Reverse Performance Hierarchies in Language Models

Summary

This research explores a surprising issue with large language models: sometimes, bigger isn't better. The paper shows that on certain tasks, larger models actually perform *worse* than smaller ones, and explains why this happens and how to fix it.

What's the problem?

Researchers found that when testing language models on a variety of problems, larger models (those with more parameters) unexpectedly underperformed smaller models about 8% of the time. This difference in performance was significant, with the larger models being about 28% less accurate. The core issue wasn't that the large models lacked the potential to be good, but rather that they were doing something that made them worse at these specific tasks.

What's the solution?

The researchers discovered that large language models tend to be overly verbose – they write much longer responses than necessary. This extra writing introduces errors and lowers accuracy. To solve this, they simply told the large models to be brief. By limiting the length of their responses, they dramatically improved accuracy, boosting it by 26% overall and even reversing the performance gap on challenging tasks like math and science. In those areas, the large models, when constrained to be concise, actually outperformed the smaller models.

Why it matters?

This work is important because it shows that simply making language models bigger doesn't automatically make them better. It highlights the need for 'scale-aware prompt engineering,' meaning we need to adjust *how* we ask these models questions based on their size. This has practical implications: by prompting models for shorter answers, we can improve their accuracy and also reduce the computational resources needed to run them, making them more efficient and cost-effective.

Abstract

Standard evaluation protocols reveal a counterintuitive phenomenon: on 7.7% of benchmark problems spanning five datasets, larger language models underperform smaller ones by 28.4 percentage points despite 10-100x more parameters. Through systematic evaluation of 31 models (0.5B-405B parameters) across 1,485 problems, we identify the mechanism as spontaneous scale-dependent verbosity that introduces errors through overelaboration. Causal intervention experiments demonstrate this reflects correctable prompt design rather than fundamental capability limitations. Constraining large models to produce brief responses improves accuracy by 26 percentage points and reduces performance gaps by up to two-thirds. Most critically, brevity constraints completely reverse performance hierarchies on mathematical reasoning and scientific knowledge benchmarks, with large models achieving 7.7-15.9 percentage point advantages over small models -- direct inversions of the original gaps. These reversals prove large models possess superior latent capabilities that universal prompting masks. We validate findings through three independent contamination tests and demonstrate inverse scaling operates continuously across the full parameter spectrum, with dataset-specific optimal scales ranging from 0.5B to 3.0B parameters. Our results establish that maximizing large model performance requires scale-aware prompt engineering rather than universal evaluation protocols, with immediate implications for deployment: prompt adaptation simultaneously improves accuracy and reduces computational costs.

View Paper