Efficient Test-Time Scaling for Small Vision-Language Models
Mehmet Onurcan Kaya, Desmond Elliott, Dim P. Papadopoulos
2025-10-06
Summary
This paper explores ways to improve the performance of smaller vision-language models, which are computer programs that can understand both images and text, without requiring a lot of computing power.
What's the problem?
Smaller vision-language models are good because they don't need as much processing power as larger ones, but they aren't as accurate or flexible when dealing with new situations or tasks. Existing methods to boost their performance during use (called 'test-time scaling') often require a lot of computation, defeating the purpose of using a small model in the first place.
What's the solution?
The researchers came up with two new techniques to improve these smaller models *during* use, without needing extra data or a lot of computing power. The first, called Test-Time Augmentation, creates slightly different versions of the input image and combines the results. The second, Test-Time Adaptation, uses the results from the first technique to subtly adjust the model's settings to improve its predictions. Importantly, neither technique changes the core model itself, just how it's used for each specific image.
Why it matters?
This work is important because it shows how to get better performance out of smaller, more efficient vision-language models. This is crucial for applications where computing resources are limited, like on phones or embedded devices, and allows these models to be used in more places without sacrificing accuracy.
Abstract
Small Vision-Language Models (VLMs) provide a computationally efficient alternative to larger models, at the cost of weaker generalization abilities and downstream task performance. These shortcomings could be addressed by test-time scaling techniques, but existing methods are typically computationally demanding, contradicting the resource-efficient design goals of small models. To address these limitations, we propose two novel and efficient test-time scaling strategies that leverage the model-internal features rather than external supervision: (i) Test-Time Augmentation (TTAug), which generates multiple augmented inputs and aggregates outputs at the token level without parameter updates, and (ii) Test-Time Adaptation (TTAdapt), which adapts model parameters during inference using consensus-based pseudolabels from TTAug. Through extensive experiments across nine benchmarks, we demonstrate consistent performance improvements while maintaining computational efficiency suitable for resource-constrained environments. The generality of our approach is demonstrated both within models at different scales and across different VLMs without additional tuning.