SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment

Yixuan Tang, Yi Yang

2025-12-04

SR-GRPO: Stable Rank as an Intrinsic Geometric Reward for Large Language Model Alignment

Summary

This paper introduces a new way to improve large language models (LLMs) without relying heavily on humans to tell them what's good or bad. It focuses on finding a signal *within* the model itself that indicates how well it's performing.

What's the problem?

Currently, making LLMs align with what humans want is difficult. Getting people to rate the quality of LLM responses is expensive and those ratings can be different depending on who you ask. Also, when you try to teach the model what's good using rewards, it can sometimes 'game the system' and find ways to get high rewards without actually improving. Finally, having the model judge its own work isn't reliable because it can be easily tricked by how you ask the question.

What's the solution?

The researchers came up with something called 'stable rank'. This measures how spread out the information is within the model's internal calculations. A good response has information distributed across many parts of the model, while a bad response focuses on just a few. They then used this 'stable rank' as a reward signal to further train the model using a technique called 'Stable Rank Group Relative Policy Optimization' (SR-GRPO). This allowed the model to improve without needing any external human feedback.

Why it matters?

This research is important because it shows we can potentially build better LLMs that are more aligned with human preferences without constantly needing human input. This makes the process of improving these models more scalable and less reliant on subjective opinions, paving the way for more reliable and trustworthy AI systems.

Abstract

Aligning Large Language Models (LLMs) with human preferences typically relies on external supervision, which faces critical limitations: human annotations are scarce and subjective, reward models are vulnerable to reward hacking, and self-evaluation methods suffer from prompt sensitivity and biases. In this work, we propose stable rank, an intrinsic, annotation-free quality signal derived from model representations. Stable rank measures the effective dimensionality of hidden states by computing the ratio of total variance to dominant-direction variance, capturing quality through how information distributes across representation dimensions. Empirically, stable rank achieves 84.04% accuracy on RewardBench and improves task accuracy by an average of 11.3 percentage points over greedy decoding via Best-of-N sampling. Leveraging this insight, we introduce Stable Rank Group Relative Policy Optimization (SR-GRPO), which uses stable rank as a reward signal for reinforcement learning. Without external supervision, SR-GRPO improves Qwen2.5-1.5B-Instruct by 10% on STEM and 19% on mathematical reasoning, outperforming both learned reward models and self-evaluation baselines. Our findings demonstrate that quality signals can be extracted from internal model geometry, offering a path toward scalable alignment without external supervision.

View Paper