Patch-Level Training for Large Language Models
Chenze Shao, Fandong Meng, Jie Zhou
2024-07-18

Summary
This paper discusses a new training method called patch-level training for large language models (LLMs) that helps reduce the time and resources needed for training without losing performance.
What's the problem?
Training large language models is very resource-intensive because they process a lot of individual tokens (words or parts of words) one at a time. This can be slow and costly, especially when dealing with extensive amounts of text data. Traditional methods require processing each token separately, which leads to high computational costs and longer training times.
What's the solution?
To solve this problem, the authors introduce patch-level training, which groups multiple tokens into larger units called patches. Instead of training the model on every single token, it learns to predict the next patch of tokens. This approach allows the model to process shorter sequences of data, significantly reducing the amount of computation needed. After this initial phase, the model continues with traditional token-level training to fine-tune its performance. Experiments showed that this method can cut training costs by half while maintaining similar performance levels compared to standard training methods.
Why it matters?
This research is important because it makes training large language models more efficient and accessible. By reducing the resources required for training, more researchers and developers can work on creating advanced AI systems. This could lead to faster advancements in natural language processing technologies, making them more practical for various applications, like chatbots and translation services.
Abstract
As Large Language Models (LLMs) achieve remarkable progress in language understanding and generation, their training efficiency has become a critical concern. Traditionally, LLMs are trained to predict the next token in a sequence. Despite the success of token-level training, it suffers from considerable computational costs due to the need to process an extensive number of tokens. To mitigate this issue, this paper introduces patch-level training for LLMs, which reduces the sequence length by compressing multiple tokens into a single patch. During patch-level training, we feed the language model shorter sequences of patches and train it to predict the next patch, thereby processing the majority of the training data at a significantly reduced computational cost. Following this, the model continues token-level training on the remaining training data to align with the inference mode. Experiments on a diverse range of models (370M-2.7B parameters) demonstrate that patch-level training can reduce overall computational costs to 0.5times, without compromising the model performance compared to token-level training. Source code: https://github.com/shaochenze/PatchTrain.