Taming Generative Synthetic Data for X-ray Prohibited Item Detection

Jialong Sun, Hongguang Zhu, Weizhe Liu, Yunda Sun, Renshuai Tao, Yunchao Wei

2025-11-24

Taming Generative Synthetic Data for X-ray Prohibited Item Detection

Summary

This paper introduces a new way to create realistic X-ray images for training security systems to detect prohibited items, like weapons or explosives.

What's the problem?

Currently, training these security systems requires a huge number of X-ray images that have been labeled to show where the prohibited items are. Getting these images and labeling them is a slow, expensive, and difficult process. Existing methods try to create these images artificially, but they usually involve a two-step process where objects are cut out of existing images and pasted into new backgrounds, which still requires a lot of manual work.

What's the solution?

The researchers developed a system called Xsyn that generates X-ray images directly from text descriptions. Imagine typing 'a handgun in a backpack' and the system creates a realistic X-ray image of that scenario. They improved this process in two key ways: first, they refine the location of the item in the image using information from the image generation process itself, and second, they make the backgrounds more complex to better mimic real-world X-ray scans. This whole process happens in one step, eliminating the need for manual object cutting and pasting.

Why it matters?

This research is important because it offers a way to create the necessary training data for security systems much more efficiently and cheaply. By automatically generating high-quality X-ray images, Xsyn can help improve the accuracy of these systems in detecting dangerous items, ultimately making security checks faster and more effective. Their method also performs better than previous attempts at creating synthetic X-ray images.

Abstract

Training prohibited item detection models requires a large amount of X-ray security images, but collecting and annotating these images is time-consuming and laborious. To address data insufficiency, X-ray security image synthesis methods composite images to scale up datasets. However, previous methods primarily follow a two-stage pipeline, where they implement labor-intensive foreground extraction in the first stage and then composite images in the second stage. Such a pipeline introduces inevitable extra labor cost and is not efficient. In this paper, we propose a one-stage X-ray security image synthesis pipeline (Xsyn) based on text-to-image generation, which incorporates two effective strategies to improve the usability of synthetic images. The Cross-Attention Refinement (CAR) strategy leverages the cross-attention map from the diffusion model to refine the bounding box annotation. The Background Occlusion Modeling (BOM) strategy explicitly models background occlusion in the latent space to enhance imaging complexity. To the best of our knowledge, compared with previous methods, Xsyn is the first to achieve high-quality X-ray security image synthesis without extra labor cost. Experiments demonstrate that our method outperforms all previous methods with 1.2% mAP improvement, and the synthetic images generated by our method are beneficial to improve prohibited item detection performance across various X-ray security datasets and detectors. Code is available at https://github.com/pILLOW-1/Xsyn/.

View Paper