The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

Henry Lim, Kwan Hui Lim

2025-10-22

The Atomic Instruction Gap: Instruction-Tuned LLMs Struggle with Simple, Self-Contained Directives

Summary

This paper investigates how well large language models, specifically those trained to follow instructions, can actually handle very simple, direct commands. It turns out they aren't as good as we might think, and are easily confused by small changes in how those commands are presented.

What's the problem?

Even though these language models are great at complex reasoning, they struggle with basic instruction-following. The researchers noticed that changing something as simple as whether options in a multiple-choice question are labeled with letters (A, B, C), numbers (1, 2, 3), or Roman numerals (I, II, III) significantly impacts their performance. This shows the models aren't truly *understanding* the instruction, but are instead picking up on superficial formatting cues. The core issue is a lack of robustness in following even the most basic directions.

What's the solution?

The researchers tested 20 different instruction-tuned language models using modified versions of standard tests (MMLU and MMLU-Pro). They systematically changed the way answer choices were labeled – switching between letters, numbers, and Roman numerals – while keeping the questions and answers the same. They also tested the models with and without explicit instructions, and even removed the actual answer choices to see if the models would just pick a label randomly. Finally, they tried giving the models a few examples to learn from, but this didn't really help. They then analyzed the models' outputs to see where they were making mistakes.

Why it matters?

This research is important because it reveals a fundamental weakness in how we're currently training these powerful language models. If they can't reliably follow simple instructions, it casts doubt on their ability to consistently handle more complex tasks. It highlights the need for new training methods and better ways to evaluate these models, specifically focusing on their ability to understand and execute atomic, or very basic, instructions.

Abstract

Instruction-tuned large language models (IT-LLMs) exhibit strong zero-shot reasoning, yet their ability to execute simple, self-contained instructions remains underexplored, despite this being foundational to complex instruction-following. We evaluate 20 IT-LLMs on modified MMLU and MMLU-Pro benchmarks, by systematically varying the format of option labels (alphabetic, numeric, Roman) while keeping their meaning identical under four paradigms, namely: (1) With explicit instructions, label changes cause large performance shifts (e.g., -30.45\% for Roman vs. numeric), revealing instruction-format bias. (2) Without instructions, performance drops further (up to -10.84\%) and label sensitivity intensifies, underscoring the role of explicit guidance. (3) When option contents are removed, models fail random-choice baselines except with numeric labels, suggesting weak adherence to atomic directives. (4) Three-shot exemplars yield no significant gains in robustness or fidelity, and generation analyses show persistent label errors, especially for non-numeric formats. Across model sizes, larger LLMs achieve higher accuracy but remain inconsistent in instruction adherence. These results expose the insufficiencies of current instruction-tuning paradigms and highlight the need for evaluation methods and training strategies that explicitly target atomic instruction-following.

View Paper