POA: Pre-training Once for Models of All Sizes
Yingying Zhang, Xin Guo, Jiangwei Lao, Lei Yu, Lixiang Ru, Jian Wang, Guo Ye, Huimei He, Jingdong Chen, Ming Yang
2024-08-05

Summary
This paper presents a new approach called POA (Pre-training Once for All), which allows a single model to be trained in a way that it can create multiple smaller or larger models at once, making it easier to adapt to different tasks and requirements.
What's the problem?
Typically, training machine learning models involves creating separate versions for each size needed, which can be time-consuming and resource-intensive. This means that if you want models of different sizes for different applications, you have to train each one individually from scratch, which is inefficient and requires a lot of computational power.
What's the solution?
To solve this problem, the authors developed the POA framework, which uses a single 'master' model that can be pre-trained on a large dataset. This master model can then be adapted to create various models of different sizes without needing to retrain them from scratch. The approach includes an 'elastic student' branch that allows the model to learn from itself during training, making it possible to generate many different-sized models efficiently. This method has been tested on various tasks and has shown to perform better than traditional methods.
Why it matters?
This research is important because it simplifies the process of developing machine learning models, making it faster and more efficient to create models that can handle different tasks. By using POA, developers can save time and resources while still achieving high performance in various applications, which is particularly beneficial in fields like computer vision and natural language processing.
Abstract
Large-scale self-supervised pre-training has paved the way for one foundation model to handle many different vision tasks. Most pre-training methodologies train a single model of a certain size at one time. Nevertheless, various computation or storage constraints in real-world scenarios require substantial efforts to develop a series of models with different sizes to deploy. Thus, in this study, we propose a novel tri-branch self-supervised training framework, termed as POA (Pre-training Once for All), to tackle this aforementioned issue. Our approach introduces an innovative elastic student branch into a modern self-distillation paradigm. At each pre-training step, we randomly sample a sub-network from the original student to form the elastic student and train all branches in a self-distilling fashion. Once pre-trained, POA allows the extraction of pre-trained models of diverse sizes for downstream tasks. Remarkably, the elastic student facilitates the simultaneous pre-training of multiple models with different sizes, which also acts as an additional ensemble of models of various sizes to enhance representation learning. Extensive experiments, including k-nearest neighbors, linear probing evaluation and assessments on multiple downstream tasks demonstrate the effectiveness and advantages of our POA. It achieves state-of-the-art performance using ViT, Swin Transformer and ResNet backbones, producing around a hundred models with different sizes through a single pre-training session. The code is available at: https://github.com/Qichuzyy/POA.