Many-Tier Instruction Hierarchy in LLM Agents

Jingyu Zhang, Tianjian Li, William Jurayj, Hongyuan Zhan, Benjamin Van Durme, Daniel Khashabi

2026-04-15

Many-Tier Instruction Hierarchy in LLM Agents

Summary

This paper investigates how well large language models (LLMs) can handle conflicting instructions when acting as agents, like virtual assistants or automated problem-solvers.

What's the problem?

Currently, LLMs are given instructions from various places – the initial setup, the user, and even tools they use themselves. When these instructions clash, the model needs to know which one to prioritize to stay safe and work correctly. Existing methods assume a simple hierarchy with only a few levels of importance, like 'system instructions are always most important' and 'user requests come next'. However, real-world situations are much more complex, with many different sources of instructions needing to be ranked, and a simple system doesn't cut it.

What's the solution?

The researchers propose a new approach called 'ManyIH' which stands for Many-Tier Instruction Hierarchy. This allows for a much more detailed and flexible system of prioritizing instructions, potentially with dozens of different levels. To test this, they created a challenging benchmark called 'ManyIH-Bench' consisting of over 850 tasks that require models to sort through up to 12 levels of conflicting instructions, covering both coding and general instruction-following scenarios. These tasks were designed to mimic real-world agent applications.

Why it matters?

The experiments showed that even the most advanced LLMs struggle with this complex instruction prioritization, achieving only around 40% accuracy. This highlights a critical weakness in current AI agents and emphasizes the urgent need for new techniques that can reliably resolve conflicts between instructions in a scalable and nuanced way, making these agents more dependable and safe to use.

Abstract

Large language model agents receive instructions from many sources-system messages, user prompts, tool outputs, and more-each carrying different levels of trust and authority. When these instructions conflict, models must reliably follow the highest-privilege instruction to remain safe and effective. The dominant paradigm, instruction hierarchy (IH), assumes a fixed, small set of privilege levels (typically fewer than five) defined by rigid role labels (e.g., system > user). This is inadequate for real-world agentic settings, where conflicts can arise across far more sources and contexts. In this work, we propose Many-Tier Instruction Hierarchy (ManyIH), a paradigm for resolving instruction conflicts among instructions with arbitrarily many privilege levels. We introduce ManyIH-Bench, the first benchmark for ManyIH. ManyIH-Bench requires models to navigate up to 12 levels of conflicting instructions with varying privileges, comprising 853 agentic tasks (427 coding and 426 instruction-following). ManyIH-Bench composes constraints developed by LLMs and verified by humans to create realistic and difficult test cases spanning 46 real-world agents. Our experiments show that even the current frontier models perform poorly (~40% accuracy) when instruction conflict scales. This work underscores the urgent need for methods that explicitly target fine-grained, scalable instruction conflict resolution in agentic settings.

View Paper