At its core, VIBE employs innovative channel-wise concatenation to fuse reference image latents directly with noisy diffusion latents, preserving token count and computational efficiency unlike traditional sequence-based methods that inflate attention costs. The system bridges modalities via learnable meta-tokens injected into the frozen VLM, which contextualize instructions like 'add a cat to the sofa' or 'change the background to nighttime,' producing conditioning features routed through a lightweight transformer connector to guide the diffusion process. This architecture supports diverse edits from attribute adjustments and object removal to background replacement and targeted additions, all while upholding anatomical accuracy and semantic fidelity across complex scenes.
VIBE's training regimen spans four meticulously crafted stages—connector alignment on text-to-image data, pre-training with mixed editing triplets and T2I injections at multi-resolutions up to 2048x2048, supervised fine-tuning on 6.8 million curated pairs, and Diffusion-DPO for preference alignment—yielding superior benchmark performance on ImgEdit and GEdit where it outperforms larger models in preservation-heavy tasks. Data curation draws from remastered UltraEdit, real-world triplets, virtual try-ons, and self-bootstrapped compositions, augmented with photometric tweaks, identity preservation prompts, and rigorous filtering for artifacts and inconsistencies. Released openly, VIBE empowers creators and developers to deploy fast, consistent image editing in pipelines from local workstations to edge devices, redefining accessible visual content creation.


