Key Features

Performs parallel region captioning with a multimodal diffusion language model.
Generates descriptions for multiple masked regions in one denoising process.
Uses efficient parallel prompting to pack many regions into one prompt.
Uses structured attention masking to avoid cross-region interference.
Shares global image context across region generation streams.
Builds on LLaDA-8B-Instruct-style diffusion language modeling.
Releases model weights, training data, and ParaDLC-Bench.
Provides paper, code, models, data, benchmark, and direct demo videos.

The model builds on PerceptionDLM-Base with a visual encoder and a discrete diffusion language model backbone. Efficient parallel prompting packs multiple region masks into one prompt, while structured attention masking isolates each region's generation stream while sharing global image context.


PerceptionDLM is useful for dense visual perception, multi-region captioning, and systems that need many localized descriptions quickly. The project releases code, model weights, training data, and ParaDLC-Bench for evaluating parallel detailed localized captioning.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner
Zero to AI Engineer Program

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!