ECHO: Efficient Chest X-ray Report Generation with One-step Block Diffusion
Lifeng Chen, Tianqi You, Hao Liu, Zhimin Bao, Jile Jiao, Xiao Han, Zhicai Ou, Tao Sun, Xiaofeng Mou, Xiaojie Jin, Yi Xu
2026-04-13
Summary
This paper introduces a new method, called ECHO, for automatically writing reports based on chest X-ray images. It aims to make this process faster and more efficient for doctors.
What's the problem?
Currently, computer programs that generate these reports often take a long time because they create the report word-by-word. Newer methods using 'diffusion' models are faster because they can generate the whole report at once, but they still require many steps to refine the report, and simplifying this process too much can lead to reports that don't make sense or aren't accurate. The main issue is that these simplified methods don't fully capture how words in a report relate to each other.
What's the solution?
ECHO solves this problem by using a technique called 'Direct Conditional Distillation' which essentially teaches the model to understand the relationships between words in the report during the diffusion process. It also uses a 'Response-Asymmetric Diffusion' training method to make the learning process more efficient. This allows ECHO to generate reports in a single step, without losing the quality and accuracy of the report.
Why it matters?
This research is important because it significantly speeds up the process of generating chest X-ray reports – making it eight times faster than existing methods – while also improving the quality and accuracy of those reports. This could greatly reduce the workload for radiologists and help them diagnose patients more quickly and effectively.
Abstract
Chest X-ray report generation (CXR-RG) has the potential to substantially alleviate radiologists' workload. However, conventional autoregressive vision--language models (VLMs) suffer from high inference latency due to sequential token decoding. Diffusion-based models offer a promising alternative through parallel generation, but they still require multiple denoising iterations. Compressing multi-step denoising to a single step could further reduce latency, but often degrades textual coherence due to the mean-field bias introduced by token-factorized denoisers. To address this challenge, we propose ECHO, an efficient diffusion-based VLM (dVLM) for chest X-ray report generation. ECHO enables stable one-step-per-block inference via a novel Direct Conditional Distillation (DCD) framework, which mitigates the mean-field limitation by constructing unfactorized supervision from on-policy diffusion trajectories to encode joint token dependencies. In addition, we introduce a Response-Asymmetric Diffusion (RAD) training strategy that further improves training efficiency while maintaining model effectiveness. Extensive experiments demonstrate that ECHO surpasses state-of-the-art autoregressive methods, improving RaTE and SemScore by 64.33\% and 60.58\% respectively, while achieving an 8times inference speedup without compromising clinical accuracy.