Gradient-Attention Guided Dual-Masking Synergetic Framework for Robust Text-based Person Retrieval
Tianlu Zheng, Yifan Zhang, Xiang An, Ziyong Feng, Kaicheng Yang, Qichuan Ding
2025-09-12
Summary
This paper focuses on improving how well computers can learn to identify and represent people in images, building upon a powerful existing system called CLIP.
What's the problem?
Using CLIP for person recognition is tricky because there isn't a lot of readily available, high-quality data showing people with accurate descriptions. Also, CLIP's original method of learning can sometimes get confused by irrelevant words in the descriptions and doesn't always focus on the important details that make people uniquely identifiable.
What's the solution?
The researchers tackled this by first creating a massive new dataset called WebPerson, containing 5 million images of people with automatically generated and carefully filtered descriptions. They used a large language model to help ensure the descriptions were good. Then, they developed a new technique called GA-DMS which intelligently ignores distracting words in the descriptions and forces the model to focus on the most informative parts, improving its ability to learn detailed representations of people.
Why it matters?
This work is important because better person representation learning has many applications, like improving security systems, helping with video surveillance, and enabling more accurate person re-identification in various scenarios. By creating a better dataset and a more refined learning method, the researchers have significantly advanced the state-of-the-art in this field.
Abstract
Although Contrastive Language-Image Pre-training (CLIP) exhibits strong performance across diverse vision tasks, its application to person representation learning faces two critical challenges: (i) the scarcity of large-scale annotated vision-language data focused on person-centric images, and (ii) the inherent limitations of global contrastive learning, which struggles to maintain discriminative local features crucial for fine-grained matching while remaining vulnerable to noisy text tokens. This work advances CLIP for person representation learning through synergistic improvements in data curation and model architecture. First, we develop a noise-resistant data construction pipeline that leverages the in-context learning capabilities of MLLMs to automatically filter and caption web-sourced images. This yields WebPerson, a large-scale dataset of 5M high-quality person-centric image-text pairs. Second, we introduce the GA-DMS (Gradient-Attention Guided Dual-Masking Synergetic) framework, which improves cross-modal alignment by adaptively masking noisy textual tokens based on the gradient-attention similarity score. Additionally, we incorporate masked token prediction objectives that compel the model to predict informative text tokens, enhancing fine-grained semantic representation learning. Extensive experiments show that GA-DMS achieves state-of-the-art performance across multiple benchmarks.