SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings and Speaks in Tokens

Nikita Dragunov, Temurbek Rahmatullaev, Elizaveta Goncharova, Andrey Kuznetsov, Anton Razzhigaev

2025-08-12

SONAR-LLM: Autoregressive Transformer that Thinks in Sentence Embeddings
and Speaks in Tokens

Summary

This paper talks about SONAR-LLM, a new type of AI language model that thinks about text in whole sentences by using sentence embeddings, but still learns and generates text one token at a time. It combines the idea of working with sentence-level meaning and the traditional way of training models on individual words or tokens.

What's the problem?

The problem is that traditional AI models work by predicting one token (word or part of a word) at a time, which can be slow and inefficient for long texts. Some newer models try to think in bigger chunks like whole sentences, but they often lose important details or have training difficulties because they don't use token-level information.

What's the solution?

The paper introduces SONAR-LLM, which predicts the meaning of the next sentence in a special sentence embedding space but trains using detailed token-level feedback. It uses a frozen special encoder and decoder for sentences, allowing the model to think in sentence chunks while keeping the accuracy and stability of token-level training. This lets it generate text quickly and with good quality without needing complex sampling methods like diffusion.

Why it matters?

This matters because it allows AI to generate long texts more efficiently and with strong understanding by blending sentence-level reasoning and token-level detail. This can make AI faster and better at tasks like writing, summarizing, or chatting, especially when dealing with long documents or conversations.

Abstract

SONAR-LLM, a decoder-only transformer using token-level cross-entropy in the SONAR embedding space, achieves competitive text generation quality without diffusion sampling.

View Paper