The model builds on PerceptionDLM-Base with a visual encoder and a discrete diffusion language model backbone. Efficient parallel prompting packs multiple region masks into one prompt, while structured attention masking isolates each region's generation stream while sharing global image context.
PerceptionDLM is useful for dense visual perception, multi-region captioning, and systems that need many localized descriptions quickly. The project releases code, model weights, training data, and ParaDLC-Bench for evaluating parallel detailed localized captioning.


