Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Songchun Zhang, Zeyue Xue, Siming Fu, Jie Huang, Xianghao Kong, Y Ma, Haoyang Huang, Nan Duan, Anyi Rao

2026-03-23

Astrolabe: Steering Forward-Process Reinforcement Learning for Distilled Autoregressive Video Models

Summary

This paper introduces a new method, called Astrolabe, for improving the quality of videos created by artificial intelligence. Specifically, it focuses on making these AI-generated videos more appealing to human viewers.

What's the problem?

AI models that generate videos efficiently sometimes create content that doesn't quite look right to people – things might be misaligned or just generally unappealing. Existing methods to fix this using reinforcement learning are either too computationally expensive, requiring a lot of processing power and memory, or they require re-training the entire video generation model from scratch, which is also time-consuming.

What's the solution?

Astrolabe tackles this by using a clever approach to reinforcement learning that doesn't need to re-train the whole model or use huge amounts of computing power. It learns by comparing good and bad examples *while* the video is being generated, making small adjustments on the fly. It also breaks down long videos into smaller chunks, updating the AI's 'understanding' of what looks good for each chunk while still keeping the overall video consistent. Finally, it uses multiple ways to judge the video's quality to prevent the AI from finding loopholes to get a good score without actually improving the video.

Why it matters?

This research is important because it makes it more practical to create high-quality videos with AI. By making the process more efficient and scalable, it opens the door to more realistic and engaging AI-generated video content, which has applications in areas like entertainment, education, and communication.

Abstract

Distilled autoregressive (AR) video models enable efficient streaming generation but frequently misalign with human visual preferences. Existing reinforcement learning (RL) frameworks are not naturally suited to these architectures, typically requiring either expensive re-distillation or solver-coupled reverse-process optimization that introduces considerable memory and computational overhead. We present Astrolabe, an efficient online RL framework tailored for distilled AR models. To overcome existing bottlenecks, we introduce a forward-process RL formulation based on negative-aware fine-tuning. By contrasting positive and negative samples directly at inference endpoints, this approach establishes an implicit policy improvement direction without requiring reverse-process unrolling. To scale this alignment to long videos, we propose a streaming training scheme that generates sequences progressively via a rolling KV-cache, applying RL updates exclusively to local clip windows while conditioning on prior context to ensure long-range coherence. Finally, to mitigate reward hacking, we integrate a multi-reward objective stabilized by uncertainty-aware selective regularization and dynamic reference updates. Extensive experiments demonstrate that our method consistently enhances generation quality across multiple distilled AR video models, serving as a robust and scalable alignment solution.

View Paper