Φeat: Physically-Grounded Feature Representation
Giuseppe Vecchio, Adrien Kaiser, Rouffet Romain, Rosalie Martin, Elena Garces, Tamy Boubekeur
2025-11-19
Summary
This paper introduces a new way to build the foundational 'brains' for computer vision systems, focusing on making them understand the physical properties of objects, not just what the objects *are*.
What's the problem?
Current computer vision systems, while good at identifying objects, struggle when they need to understand how light interacts with surfaces or how an object's material affects its appearance. They mix up what something *is* with how it *looks* under different conditions, like varying lighting or shape, making it hard for them to reason about the physical world.
What's the solution?
The researchers created a system called Φeat that learns by looking at many different views of the same material – think different shapes made of wood, or the same wood under different lighting. The system isn't *told* what the material is; it learns to recognize materials based on how they reflect light and their surface texture by comparing different views. This 'self-supervised' learning process helps it separate the material properties from things like shape and lighting.
Why it matters?
This work is important because it shows computers can learn to 'see' the physical world without needing a ton of labeled data. This could lead to better robots that understand how objects behave, more realistic computer graphics, and generally smarter vision systems that aren't fooled by changes in lighting or viewpoint.
Abstract
Foundation models have emerged as effective backbones for many vision tasks. However, current self-supervised features entangle high-level semantics with low-level physical factors, such as geometry and illumination, hindering their use in tasks requiring explicit physical reasoning. In this paper, we introduce Φeat, a novel physically-grounded visual backbone that encourages a representation sensitive to material identity, including reflectance cues and geometric mesostructure. Our key idea is to employ a pretraining strategy that contrasts spatial crops and physical augmentations of the same material under varying shapes and lighting conditions. While similar data have been used in high-end supervised tasks such as intrinsic decomposition or material estimation, we demonstrate that a pure self-supervised training strategy, without explicit labels, already provides a strong prior for tasks requiring robust features invariant to external physical factors. We evaluate the learned representations through feature similarity analysis and material selection, showing that Φeat captures physically-grounded structure beyond semantic grouping. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics. These findings highlight the promise of unsupervised physical feature learning as a foundation for physics-aware perception in vision and graphics.