Key Features

Text-guided image editing via natural language instructions
Strict source consistency preserving original identities and layouts
Channel-wise latent concatenation for efficient reference guidance
Learnable meta-tokens bridging VLM and diffusion modalities
High-throughput inference at 2K resolution in 4 seconds on H100
Compact design fitting 24GB GPU memory with 3.6B total parameters
Multi-resolution support from 384x384 to 2048x2048
Four-stage training including DPO for human-preferred outputs

At its core, VIBE employs innovative channel-wise concatenation to fuse reference image latents directly with noisy diffusion latents, preserving token count and computational efficiency unlike traditional sequence-based methods that inflate attention costs. The system bridges modalities via learnable meta-tokens injected into the frozen VLM, which contextualize instructions like 'add a cat to the sofa' or 'change the background to nighttime,' producing conditioning features routed through a lightweight transformer connector to guide the diffusion process. This architecture supports diverse edits from attribute adjustments and object removal to background replacement and targeted additions, all while upholding anatomical accuracy and semantic fidelity across complex scenes.


VIBE's training regimen spans four meticulously crafted stages—connector alignment on text-to-image data, pre-training with mixed editing triplets and T2I injections at multi-resolutions up to 2048x2048, supervised fine-tuning on 6.8 million curated pairs, and Diffusion-DPO for preference alignment—yielding superior benchmark performance on ImgEdit and GEdit where it outperforms larger models in preservation-heavy tasks. Data curation draws from remastered UltraEdit, real-world triplets, virtual try-ons, and self-bootstrapped compositions, augmented with photometric tweaks, identity preservation prompts, and rigorous filtering for artifacts and inconsistencies. Released openly, VIBE empowers creators and developers to deploy fast, consistent image editing in pipelines from local workstations to edge devices, redefining accessible visual content creation.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner
Zero to AI Engineer Program

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!