Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification

William Yang, Xindi Wu, Zhiwei Deng, Esin Tureci, Olga Russakovsky

2025-11-03

Beyond Objects: Contextual Synthetic Data Generation for Fine-Grained Classification

Summary

This paper focuses on improving how we use artificial intelligence to create fake images for training other AI systems, specifically for tasks where we need to distinguish between very similar things.

What's the problem?

Creating realistic fake images to help train AI is tricky. If you try to make the AI image generator too closely match a small number of real images, it can become overly specific and won't generate diverse enough examples. This limits how well the final AI performs at recognizing different variations of the thing you're trying to identify.

What's the solution?

The researchers developed a new method called BOB, which stands for BeyondOBjects. Instead of directly fine-tuning the image generator on the real images, they first identify general characteristics like the background or how an object is positioned. They then tell the image generator to focus on these general characteristics *while* learning from the real images, but then ignore those characteristics when actually creating the fake images. This prevents the generator from memorizing the specific real images and encourages it to create a wider variety of realistic fakes.

Why it matters?

This research is important because it allows us to create better training data using AI-generated images, even when we only have a few real examples. This is especially useful for complex tasks like identifying different types of aircraft or birds where getting a large collection of real, labeled images can be difficult and expensive. The method consistently outperformed previous techniques, sometimes even performing better with just five real images and AI-generated data than with ten real images alone.

Abstract

Text-to-image (T2I) models are increasingly used for synthetic dataset generation, but generating effective synthetic training data for classification remains challenging. Fine-tuning a T2I model with a few real examples can help improve the quality of synthetic training data; however, it may also cause overfitting and reduce diversity in the generated samples. We propose a fine-tuning strategy BOB (BeyondOBjects) to mitigate these concerns for fine-grained classification. Given a small set of real examples, we first extract class-agnostic attributes such as scene background and object pose. We then explicitly condition on these attributes during fine-tuning of the T2I model and marginalize them out during generation. This design mitigates overfitting, preserves the T2I model's generative prior, reduces estimation errors, and further minimizes unintended inter-class associations. Extensive experiments across multiple T2I models, backbones, and datasets show that our method achieves state-of-the-art performance in low-shot fine-grained classification when augmented with synthetic data. Concretely, BOB outperforms DataDream by 7.4% on the Aircraft dataset (from 50.0% to 57.4% when fine-tuning a CLIP classifier with five real images augmented with 100 synthetic images). In three of the four benchmarks, fine-tuning downstream models with 5 real images augmented with BOB achieves better performance than fine-tuning with 10 real images. Collectively, BOB outperforms prior art in 18 of 24 experimental settings, with 2+% accuracy improvements in 14 of these settings.

View Paper