ASPO: Asymmetric Importance Sampling Policy Optimization

Jiakang Wang, Runze Liu, Lei Lin, Wenping Hu, Xiu Li, Fuzheng Zhang, Guorui Zhou, Kun Gai

2025-10-08

ASPO: Asymmetric Importance Sampling Policy Optimization

Summary

This paper focuses on a problem with how large language models are improved after their initial training, specifically when using a technique called Reinforcement Learning from Human Feedback. It introduces a new method, ASPO, to make this improvement process more effective.

What's the problem?

When fine-tuning large language models using Reinforcement Learning, a common practice is to adjust the model based on how well each individual 'token' (think of a word or part of a word) contributes to a good outcome. However, the paper identifies that the way these tokens are weighted isn't fair. Tokens that already have a high chance of being correct get *too* much emphasis, while less likely but potentially valuable tokens get ignored. This creates an imbalance that hinders the model's learning and can cause it to get stuck prematurely.

What's the solution?

The researchers propose a new technique called Asymmetric Importance Sampling Policy Optimization, or ASPO. Essentially, ASPO flips the weighting for tokens that are already likely to be correct, so they don't dominate the learning process. It also adds a safety mechanism to prevent the model from making overly drastic changes during training, ensuring a more stable learning process. This allows the model to better consider and learn from less common, but potentially important, tokens.

Why it matters?

This work is important because it reveals a fundamental flaw in current methods for improving large language models. By correcting the way tokens are weighted during Reinforcement Learning, ASPO leads to more stable training, prevents the model from getting stuck, and ultimately improves its performance on challenging tasks like coding and mathematical reasoning. It provides a better understanding of how to effectively fine-tune these powerful models.

Abstract

Recent Large Language Model (LLM) post-training methods rely on token-level clipping mechanisms during Reinforcement Learning (RL). However, we identify a fundamental flaw in this Outcome-Supervised RL (OSRL) paradigm: the Importance Sampling (IS) ratios of positive-advantage tokens are mismatched, leading to unbalanced token weighting for positive and negative tokens. This mismatch suppresses the update of low-probability tokens while over-amplifying already high-probability ones. To address this, we propose Asymmetric Importance Sampling Policy Optimization (ASPO), which uses a simple yet effective strategy that flips the IS ratios of positive-advantage tokens, aligning their update direction with the learning dynamics of negative ones. AIS further incorporates a soft dual-clipping mechanism to stabilize extreme updates while maintaining gradient flow. Comprehensive experiments on coding and mathematical reasoning benchmarks demonstrate that ASPO significantly mitigates premature convergence, improves training stability, and enhances final performance over strong GRPO-based baselines. Our analysis provides new insights into the role of token-level weighting in OSRL and highlights the critical importance of correcting IS in LLM RL. The code and models of ASPO are available at https://github.com/wizard-III/Archer2.0.

View Paper