OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device Speculative Decoding

Ramchalam Kinattinkara Ramakrishnan, Zhaocong Yuan, Shaojie Zhuo, Chen Feng, Yicheng Lin, Chenzheng Su, Xiaopeng Zhang

2025-07-08

OmniDraft: A Cross-vocabulary, Online Adaptive Drafter for On-device
Speculative Decoding

Summary

This paper talks about OmniDraft, a new system that allows a single lightweight AI draft model to work together with many different large language models on devices like phones or laptops. It helps these models generate text faster by guessing ahead and constantly learning to improve its suggestions.

What's the problem?

The problem is that traditional draft models need to be specially matched to one target model's vocabulary and training, which makes it hard to use them with different models or adapt to changing user needs. This limits their effectiveness and increases maintenance costs.

What's the solution?

The researchers created OmniDraft with an online n-gram cache that maps between different vocabularies, enabling speculative decoding across models that don’t use the same tokens. The system also continuously fine-tunes the draft model using feedback from the target model, allowing it to adapt dynamically to user data and improve over time. It also uses adaptive drafting techniques to balance speed and accuracy during text generation.

Why it matters?

This matters because it makes on-device AI applications more efficient, flexible, and personalized, allowing a single draft model to speed up various large language models while adapting to different tasks and users, reducing costs and improving user experience.

Abstract

OmniDraft is a unified framework enabling a single draft model to dynamically adapt to various target models, addressing cross-vocabulary mismatches and improving decoding speed for on-device LLM applications.

View Paper