VIBE

NEW

Free ImageEdit VisualTools

LikeWebsite Promote

Key Features

Text-guided image editing via natural language instructions

Strict source consistency preserving original identities and layouts

Channel-wise latent concatenation for efficient reference guidance

Learnable meta-tokens bridging VLM and diffusion modalities

High-throughput inference at 2K resolution in 4 seconds on H100

Compact design fitting 24GB GPU memory with 3.6B total parameters

Multi-resolution support from 384x384 to 2048x2048

Four-stage training including DPO for human-preferred outputs

At its core, VIBE employs innovative channel-wise concatenation to fuse reference image latents directly with noisy diffusion latents, preserving token count and computational efficiency unlike traditional sequence-based methods that inflate attention costs. The system bridges modalities via learnable meta-tokens injected into the frozen VLM, which contextualize instructions like 'add a cat to the sofa' or 'change the background to nighttime,' producing conditioning features routed through a lightweight transformer connector to guide the diffusion process. This architecture supports diverse edits from attribute adjustments and object removal to background replacement and targeted additions, all while upholding anatomical accuracy and semantic fidelity across complex scenes.

VIBE's training regimen spans four meticulously crafted stages—connector alignment on text-to-image data, pre-training with mixed editing triplets and T2I injections at multi-resolutions up to 2048x2048, supervised fine-tuning on 6.8 million curated pairs, and Diffusion-DPO for preference alignment—yielding superior benchmark performance on ImgEdit and GEdit where it outperforms larger models in preservation-heavy tasks. Data curation draws from remastered UltraEdit, real-world triplets, virtual try-ons, and self-bootstrapped compositions, augmented with photometric tweaks, identity preservation prompts, and rigorous filtering for artifacts and inconsistencies. Released openly, VIBE empowers creators and developers to deploy fast, consistent image editing in pipelines from local workstations to edge devices, redefining accessible visual content creation.

Get more likes & reach the top of search results by adding this button on your site!

VIBE

Key Features

Zero to AI Engineer

Subscribe to the AI Search Newsletter