Perception Encoder: The best visual embeddings are not at the output of the network
Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, Junke Wang, Marco Monteiro, Hu Xu, Shiyu Dong, Nikhila Ravi, Daniel Li, Piotr Dollár, Christoph Feichtenhofer
2025-04-18
Summary
This paper talks about Perception Encoder, a new approach for getting the best information out of visual AI models by using data from the middle of the network, rather than just the final output.
What's the problem?
The problem is that most AI systems that work with images or videos only use the information produced at the very end of the network, which might not actually be the most useful or detailed for certain tasks. This means they could be missing out on better ways to understand or describe visual content.
What's the solution?
The researchers trained their model using a method called contrastive vision-language learning, and then used special alignment techniques to pull out 'embeddings'—which are like summaries of what the model sees—from the middle layers of the network. These intermediate embeddings turned out to be more powerful for a wide range of image and video tasks, leading to state-of-the-art results.
Why it matters?
This matters because it shows that we can make AI systems much better at understanding and working with visual information just by changing where we look for the most useful data inside the model. This can lead to smarter image search, better video analysis, and improved performance in lots of real-world applications.
Abstract
Perception Encoder, trained via contrastive vision-language learning, achieves state-of-the-art performance across various image and video tasks using intermediate embeddings extracted through alignment methods.