COMPACT: COMPositional Atomic-to-Complex Visual Capability Tuning
Xindi Wu, Hee Seung Hwang, Polina Kirichenko, Olga Russakovsky
2025-05-01
Summary
This paper talks about COMPACT, a new way to help AI models get better at understanding and working with both pictures and words, even when there's not a lot of training data available.
What's the problem?
Most AI systems need tons of data to learn how to handle complicated tasks that mix images and language, which makes training expensive and slow, especially for smaller projects.
What's the solution?
The researchers developed COMPACT, a method that lets the AI learn complex visual and language skills by building up from simpler pieces, so it can perform well on tough tasks without needing as much data as usual.
Why it matters?
This matters because it makes powerful AI tools more practical for everyone, allowing better image and language understanding even when resources are limited, which is helpful for education, research, and creative projects.
Abstract
COMPACT, a compositional visual capability tuning method, improves multimodal large language models' performance on complex vision-language tasks with less data than traditional visual instruction tuning.