Towards Open-Vocabulary Industrial Defect Understanding with a Large-Scale Multimodal Dataset
TsaiChing Ni, ZhenQi Chen, YuanFu Yang
2026-01-09
Summary
This paper introduces a massive new dataset called IMDD-1M, which contains a million images of defective products along with detailed text descriptions of those defects, and uses it to build a powerful AI model for quality control in manufacturing.
What's the problem?
Currently, inspecting products for defects in factories is often done manually or with specialized AI systems that require a lot of data to train for each specific type of defect. Creating these specialized systems is expensive and time-consuming, and it's hard to build a system that can recognize a wide variety of defects across different materials. There wasn't a large, publicly available dataset that combined images and detailed text descriptions of industrial defects to help researchers develop more general AI solutions.
What's the solution?
The researchers created IMDD-1M, a dataset with a million images of defects in over 60 materials and 400 different types of flaws, each with a detailed text description. Then, they used this dataset to train a new AI model, called a diffusion model, from the ground up. This model is designed to understand both images and text, and can be quickly adapted to specific factory inspection tasks with only a small amount of additional training data.
Why it matters?
This work is important because it provides a foundation for building more efficient and adaptable AI systems for quality control in manufacturing. By using a large, general dataset and a powerful AI model, factories can potentially reduce the cost and effort of inspecting products, improve product quality, and quickly adapt to new types of defects without needing to retrain everything from scratch. It moves us closer to 'smart' factories that can automatically identify and address quality issues.
Abstract
We present IMDD-1M, the first large-scale Industrial Multimodal Defect Dataset comprising 1,000,000 aligned image-text pairs, designed to advance multimodal learning for manufacturing and quality inspection. IMDD-1M contains high-resolution real-world defects spanning over 60 material categories and more than 400 defect types, each accompanied by expert-verified annotations and fine-grained textual descriptions detailing defect location, severity, and contextual attributes. This dataset enables a wide spectrum of applications, including classification, segmentation, retrieval, captioning, and generative modeling. Building upon IMDD-1M, we train a diffusion-based vision-language foundation model from scratch, specifically tailored for industrial scenarios. The model serves as a generalizable foundation that can be efficiently adapted to specialized domains through lightweight fine-tuning. With less than 5% of the task-specific data required by dedicated expert models, it achieves comparable performance, highlighting the potential of data-efficient foundation model adaptation for industrial inspection and generation, paving the way for scalable, domain-adaptive, and knowledge-grounded manufacturing intelligence.