Efficient Pretraining Length Scaling

Bohong Wu, Shen Yan, Sijun Zhang, Jianqiao Lu, Yutao Zeng, Ya Wang, Xun Zhou

2025-04-23

Summary

This paper talks about a new way to train large language models called the PHD-Transformer framework, which makes it easier and faster for these models to handle longer pieces of text during training.

What's the problem?

The problem is that when language models are trained on longer texts, it usually takes a lot more computer power and memory, which can slow down the process and make it very expensive. This limits how well the models can learn from big, complex documents.

What's the solution?

The researchers developed the PHD-Transformer, which uses a smarter way to manage memory, specifically the key-value (KV) cache, so the model can process longer texts more efficiently. This means the model can be trained on longer data without using up as many resources, and it actually performs better on a variety of tests.

Why it matters?

This matters because it allows AI models to learn from more detailed and realistic information, making them better at understanding and generating longer and more complex texts, which is useful for things like summarizing books, analyzing reports, or answering complicated questions.

Abstract

A novel PHD-Transformer framework enables efficient length scaling during pre-training with optimized KV cache management, achieving improvements across multiple benchmarks.

View Paper