Position: The Complexity of Perfect AI Alignment -- Formalizing the RLHF Trilemma
Subramanyam Sahoo, Aman Chadha, Vinija Jain, Divya Chaudhary
2025-11-27
Summary
This paper explores a core challenge in building AI systems, specifically large language models, that are helpful and harmless. It focuses on 'Reinforcement Learning from Human Feedback' (RLHF), a common technique used to train these models, and reveals a fundamental limitation in trying to make them simultaneously safe, fair, and reliable.
What's the problem?
The problem is that when training AI with human feedback, there's a trade-off between making the AI represent the values of *everyone* (fairness), keeping the training process manageable in terms of time and resources, and making the AI consistently behave as expected even when faced with tricky or unusual inputs (robustness). The paper argues you can’t fully achieve all three at once – it’s a 'trilemma'. Specifically, to truly represent a diverse global population’s values and be robust requires an impossibly large amount of data and computing power.
What's the solution?
The researchers used a mathematical approach, combining ideas from statistics and optimization, to *prove* this limitation. They showed that getting both good representation of diverse values and strong robustness needs a computational effort that grows exponentially with the complexity of the situations the AI might encounter. They also point out that current RLHF methods get around this by only collecting feedback from a small, similar group of people, which means they don’t truly represent everyone’s views.
Why it matters?
This work is important because it explains why current AI systems often exhibit problems like bias, a tendency to agree with users even when wrong ('sycophancy'), and a collapse of diverse preferences into a few common ones. It provides a theoretical understanding of these issues, moving beyond just observing them. By understanding these fundamental limits, researchers can focus on smarter ways to balance these competing goals and build more reliable and equitable AI systems, even if perfect alignment is impossible.
Abstract
Reinforcement Learning from Human Feedback (RLHF) is widely used for aligning large language models, yet practitioners face a persistent puzzle: improving safety often reduces fairness, scaling to diverse populations becomes computationally intractable, and making systems robust often amplifies majority biases. We formalize this tension as the Alignment Trilemma: no RLHF system can simultaneously achieve (i) epsilon-representativeness across diverse human values, (ii) polynomial tractability in sample and compute complexity, and (iii) delta-robustness against adversarial perturbations and distribution shift. Through a complexity-theoretic analysis integrating statistical learning theory and robust optimization, we prove that achieving both representativeness (epsilon <= 0.01) and robustness (delta <= 0.001) for global-scale populations requires Omega(2^{d_context}) operations, which is super-polynomial in the context dimensionality. We show that current RLHF implementations resolve this trilemma by sacrificing representativeness: they collect only 10^3--10^4 samples from homogeneous annotator pools while 10^7--10^8 samples are needed for true global representation. Our framework provides a unified explanation for documented RLHF pathologies including preference collapse, sycophancy, and systematic bias amplification. We conclude with concrete directions for navigating these fundamental trade-offs through strategic relaxations of alignment requirements.