Bolmo: Byteifying the Next Generation of Language Models
Benjamin Minixhofer, Tyler Murray, Tomasz Limisiewicz, Anna Korhonen, Luke Zettlemoyer, Noah A. Smith, Edoardo M. Ponti, Luca Soldaini, Valentin Hofmann
2025-12-22
Summary
This paper introduces Bolmo, a new type of language model that works directly with individual bytes instead of larger pieces of text. It's available in two sizes, 1 billion and 7 billion parameters, making it relatively small but still powerful.
What's the problem?
Traditional language models break down text into 'subwords' – common chunks of characters. While effective, this method can struggle with understanding individual characters and has limitations because the vocabulary of these subwords is fixed. Existing attempts to build language models that work directly with bytes haven't performed as well as those using subwords, and training them from scratch is very resource intensive.
What's the solution?
The researchers took existing subword-level language models and 'byteified' them, meaning they converted them to work with individual bytes. They designed a special architecture for Bolmo that allows it to effectively learn from the original subword model, requiring only a small amount of additional training – less than 1% of what it would take to train a model from scratch. This process allows Bolmo to understand characters better and even improve performance on tasks like coding, while maintaining similar overall performance to the original subword model.
Why it matters?
This work demonstrates that byte-level language models can be a practical and competitive alternative to subword-level models. Bolmo achieves performance comparable to, and sometimes better than, existing models, while also being faster and easier to adapt and improve using existing tools and techniques. This opens the door for wider use of byte-level models in various applications.
Abstract
We introduce Bolmo, the first family of competitive fully open byte-level language models (LMs) at the 1B and 7B parameter scales. In contrast to prior research on byte-level LMs, which focuses predominantly on training from scratch, we train Bolmo by byteifying existing subword-level LMs. Byteification enables overcoming the limitations of subword tokenization - such as insufficient character understanding and efficiency constraints due to the fixed subword vocabulary - while performing at the level of leading subword-level LMs. Bolmo is specifically designed for byteification: our architecture resolves a mismatch between the expressivity of prior byte-level architectures and subword-level LMs, which makes it possible to employ an effective exact distillation objective between Bolmo and the source subword model. This allows for converting a subword-level LM to a byte-level LM by investing less than 1\% of a typical pretraining token budget. Bolmo substantially outperforms all prior byte-level LMs of comparable size, and outperforms the source subword-level LMs on character understanding and, in some cases, coding, while coming close to matching the original LMs' performance on other tasks. Furthermore, we show that Bolmo can achieve inference speeds competitive with subword-level LMs by training with higher token compression ratios, and can be cheaply and effectively post-trained by leveraging the existing ecosystem around the source subword-level LM. Our results finally make byte-level LMs a practical choice competitive with subword-level LMs across a wide set of use cases.