PerceptionDLM

NEW

Free Multimodal Open-Source

LikeWebsite Promote

Key Features

Performs parallel region captioning with a multimodal diffusion language model.

Generates descriptions for multiple masked regions in one denoising process.

Uses efficient parallel prompting to pack many regions into one prompt.

Uses structured attention masking to avoid cross-region interference.

Shares global image context across region generation streams.

Builds on LLaDA-8B-Instruct-style diffusion language modeling.

Releases model weights, training data, and ParaDLC-Bench.

Provides paper, code, models, data, benchmark, and direct demo videos.

The model builds on PerceptionDLM-Base with a visual encoder and a discrete diffusion language model backbone. Efficient parallel prompting packs multiple region masks into one prompt, while structured attention masking isolates each region's generation stream while sharing global image context.

PerceptionDLM is useful for dense visual perception, multi-region captioning, and systems that need many localized descriptions quickly. The project releases code, model weights, training data, and ParaDLC-Bench for evaluating parallel detailed localized captioning.

Get more likes & reach the top of search results by adding this button on your site!

PerceptionDLM

Key Features

Zero to AI Engineer

Subscribe to the AI Search Newsletter