Scaling Embedding Layers in Language Models

Da Yu, Edith Cohen, Badih Ghazi, Yangsibo Huang, Pritish Kamath, Ravi Kumar, Daogao Liu, Chiyuan Zhang

2025-02-04

Scaling Embedding Layers in Language Models

Summary

This paper talks about SCONE, a new method that helps AI language models work better by improving how they store and use word meanings. It focuses on making these models faster and more efficient without using extra computing power during their operation.

What's the problem?

Language models, like those used in chatbots or translation tools, rely on embedding layers to store information about words. As these models grow larger to handle more complex tasks, their embedding layers become harder to manage because they take up a lot of memory and slow down processing. Expanding these layers often leads to higher costs and slower speeds, making it difficult to scale the models effectively.

What's the solution?

The researchers created SCONE, which introduces a smarter way to handle embeddings by adding representations for frequently used word combinations, called n-grams. These n-gram embeddings are learned during training and stored separately so they don’t slow down the model during use. SCONE allows the system to scale by increasing the number of n-gram embeddings or improving the model that learns them, all while keeping the processing speed the same. Tests showed that SCONE could outperform a large baseline model while using half the computational resources during operation.

Why it matters?

This research is important because it makes AI systems more efficient and practical for real-world use. By reducing memory requirements and speeding up processing, SCONE allows larger and more powerful language models to be deployed without needing expensive hardware. This could lead to better AI tools for tasks like writing assistance, customer support, and education while keeping costs lower and performance high.

Abstract

We propose SCONE (Scalable, Contextualized, Offloaded, N-gram Embedding), a method for extending input embedding layers to enhance language model performance as layer size scales. To avoid increased decoding costs, SCONE retains the original vocabulary while introducing embeddings for a set of frequent n-grams. These embeddings provide contextualized representation for each input token and are learned with a separate model during training. During inference, they are precomputed and stored in off-accelerator memory with minimal impact on inference speed. SCONE enables two new scaling strategies: increasing the number of cached n-gram embeddings and scaling the model used to learn them, all while maintaining fixed inference-time FLOPS. We show that scaling both aspects allows SCONE to outperform a 1.9B parameter baseline across diverse corpora, while using only half the inference-time FLOPS.

View Paper