T-FREE: Tokenizer-Free Generative LLMs via Sparse Representations for Memory-Efficient Embeddings
Björn Deiseroth, Manuel Brack, Patrick Schramowski, Kristian Kersting, Samuel Weinbach
2024-06-28
Summary
This paper talks about T-FREE, a new approach to building large language models (LLMs) that eliminates the need for traditional tokenizers. Instead of breaking text into smaller pieces (tokens), T-FREE uses a method that directly represents words through sparse patterns based on groups of three characters.
What's the problem?
Traditional tokenizers, which are essential for processing language in AI models, have several problems. They can be slow and inefficient, leading to large vocabularies that waste memory. Moreover, they often perform poorly with languages that are not well-represented in the training data, making them less effective for many users around the world.
What's the solution?
To address these issues, the authors developed T-FREE, which uses sparse activation patterns over character triplets to represent words. This means that instead of needing a large set of tokens, T-FREE can capture the meaning of words by focusing on the structure of their letters. This method allows for a significant reduction in the size of the model's embedding layers—over 85%—while still maintaining strong performance in understanding and generating language. Additionally, T-FREE improves how well models can learn from different languages without needing separate training for each one.
Why it matters?
This research is important because it offers a more efficient way to process language in AI systems. By reducing the reliance on traditional tokenization methods, T-FREE can help create faster and more effective language models that work better across various languages. This could lead to improved AI applications in translation, content creation, and other areas where understanding language is crucial.
Abstract
Tokenizers are crucial for encoding information in Large Language Models, but their development has recently stagnated, and they contain inherent weaknesses. Major limitations include computational overhead, ineffective vocabulary use, and unnecessarily large embedding and head layers. Additionally, their performance is biased towards a reference corpus, leading to reduced effectiveness for underrepresented languages. To remedy these issues, we propose T-FREE, which directly embeds words through sparse activation patterns over character triplets, and does not require a reference corpus. T-FREE inherently exploits morphological similarities and allows for strong compression of embedding layers. In our exhaustive experimental evaluation, we achieve competitive downstream performance with a parameter reduction of more than 85% on these layers. Further, T-FREE shows significant improvements in cross-lingual transfer learning.