Training Noise Token Pruning
Mingxing Rao, Bohan Jiang, Daniel Moyer
2024-12-02

Summary
This paper introduces Training Noise Token (TNT) Pruning, a new method for improving vision transformers by optimizing how they handle input tokens during training.
What's the problem?
Vision transformers, which are used for tasks like image classification, often need to process a lot of data, making them slow and resource-intensive. Traditional methods for reducing the number of tokens (pieces of input data) can be too rigid and may not work well during training, leading to inefficiencies and lower performance in the final model.
What's the solution?
TNT Pruning addresses this issue by allowing a more flexible approach to token management. Instead of simply dropping tokens, it adds continuous noise during training to help the model learn which tokens are most important. This method maintains efficiency during deployment by still allowing some tokens to be dropped. The researchers also connect their approach to existing theories in signal processing and demonstrate its effectiveness through experiments on the ImageNet dataset, showing that TNT Pruning outperforms previous methods.
Why it matters?
This research is significant because it provides a more effective way to train vision transformers, making them faster and more efficient while maintaining or improving their accuracy. By optimizing how these models handle input data, TNT Pruning can enhance applications in computer vision, such as image recognition and classification, which are increasingly important in various fields like healthcare, security, and autonomous vehicles.
Abstract
In the present work we present Training Noise Token (TNT) Pruning for vision transformers. Our method relaxes the discrete token dropping condition to continuous additive noise, providing smooth optimization in training, while retaining discrete dropping computational gains in deployment settings. We provide theoretical connections to Rate-Distortion literature, and empirical evaluations on the ImageNet dataset using ViT and DeiT architectures demonstrating TNT's advantages over previous pruning methods.