MARS: Enabling Autoregressive Models Multi-Token Generation

Ziqi Jin, Lei Wang, Ziwei Luo, Aixin Sun

2026-04-09

MARS: Enabling Autoregressive Models Multi-Token Generation

Summary

This paper introduces a new method called MARS, which aims to speed up how quickly large language models generate text without sacrificing the quality of the output.

What's the problem?

Normally, language models create text one word (or 'token') at a time, even if the next few words are pretty obvious based on what's already been written. This is slow and inefficient, especially when dealing with long pieces of text. Existing methods to speed things up often require complex changes to the model or using multiple models at once, which adds complexity and cost.

What's the solution?

MARS works by training the existing language model to predict *multiple* tokens at once during a single processing step. It doesn't change the model's basic structure or add any new parts, it just refines how it's used. They also developed a smarter way to store previously calculated information ('KV caching') to further boost speed, and a way to dynamically adjust the speed based on how busy the system is. Essentially, it learns to 'fill in the blanks' more efficiently.

Why it matters?

This research is important because it offers a simple and effective way to make large language models faster without needing to overhaul their design. This means quicker response times for users, lower costs for running these models, and the ability to handle more requests simultaneously. The ability to adjust speed on the fly is also a big win for real-world applications where demand fluctuates.

Abstract

Autoregressive (AR) language models generate text one token at a time, even when consecutive tokens are highly predictable given earlier context. We introduce MARS (Mask AutoRegreSsion), a lightweight fine-tuning method that teaches an instruction-tuned AR model to predict multiple tokens per forward pass. MARS adds no architectural modifications, no extra parameters, and produces a single model that can still be called exactly like the original AR model with no performance degradation. Unlike speculative decoding, which maintains a separate draft model alongside the target, or multi-head approaches such as Medusa, which attach additional prediction heads, MARS requires only continued training on existing instruction data. When generating one token per forward pass, MARS matches or exceeds the AR baseline on six standard benchmarks. When allowed to accept multiple tokens per step, it maintains baseline-level accuracy while achieving 1.5-1.7x throughput. We further develop a block-level KV caching strategy for batch inference, achieving up to 1.71x wall-clock speedup over AR with KV cache on Qwen2.5-7B. Finally, MARS supports real-time speed adjustment via confidence thresholding: under high request load, the serving system can increase throughput on the fly without swapping models or restarting, providing a practical latency-quality knob for deployment.

View Paper