Thinking with Camera: A Unified Multimodal Model for Camera-Centric Understanding and Generation
Kang Liao, Size Wu, Zhonghua Wu, Linyi Jin, Chao Wang, Yikai Wang, Fei Wang, Wei Li, Chen Change Loy
2025-10-13
Summary
This paper introduces Puffin, a new artificial intelligence model that can both understand and create images based on camera viewpoints, essentially giving it a strong sense of spatial awareness.
What's the problem?
Traditionally, AI models that understand images and those that generate images from different viewpoints have been developed separately. This means they don't work together well, and it's hard for an AI to truly 'think' about a scene from different perspectives like a human does. Existing models struggle to connect what a camera 'sees' with how we describe things visually and spatially.
What's the solution?
The researchers created Puffin, which combines understanding and generation into one system. They treated the camera's settings – like its position and angle – as if they were a language the AI could understand. This allowed Puffin to learn how camera parameters relate to what's in the image and how to describe it. They trained Puffin on a huge dataset of images, camera information, and descriptions, and used a technique called diffusion to create new images from different viewpoints. They also used both overall camera settings and detailed pixel-level camera information to make the generation more accurate and flexible.
Why it matters?
Puffin is important because it's a step towards AI that can truly understand and interact with the physical world. It can be used for things like imagining what a scene looks like from a different angle, exploring virtual environments, or even giving advice on photography. By releasing the model and data, the researchers hope to encourage further research in this area of 'spatial intelligence'.
Abstract
Camera-centric understanding and generation are two cornerstones of spatial intelligence, yet they are typically studied in isolation. We present Puffin, a unified camera-centric multimodal model that extends spatial awareness along the camera dimension. Puffin integrates language regression and diffusion-based generation to interpret and create scenes from arbitrary viewpoints. To bridge the modality gap between cameras and vision-language, we introduce a novel paradigm that treats camera as language, enabling thinking with camera. This guides the model to align spatially grounded visual cues with photographic terminology while reasoning across geometric context. Puffin is trained on Puffin-4M, a large-scale dataset of 4 million vision-language-camera triplets. We incorporate both global camera parameters and pixel-wise camera maps, yielding flexible and reliable spatial generation. Experiments demonstrate Puffin superior performance over specialized models for camera-centric generation and understanding. With instruction tuning, Puffin generalizes to diverse cross-view tasks such as spatial imagination, world exploration, and photography guidance. We will release the code, models, dataset pipeline, and benchmark to advance multimodal spatial intelligence research.