Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Abdelrahman Shaker, Ahmed Heakl, Jaseel Muhammad, Ritesh Thawkar, Omkar Thawakar, Senmao Li, Hisham Cholakkal, Ian Reid, Eric P. Xing, Salman Khan, Fahad Shahbaz Khan

2026-02-24

Mobile-O: Unified Multimodal Understanding and Generation on Mobile Device

Summary

This paper introduces Mobile-O, a new artificial intelligence model that can both understand what's in images and create new images based on text prompts, all directly on a mobile phone.

What's the problem?

Current AI models that can do both image understanding and image generation are really big and require a lot of training data, making them impractical to use on phones or other devices without a constant internet connection to powerful computers in the cloud. They also take a long time to process images.

What's the solution?

The researchers created Mobile-O, a smaller, more efficient model. The key to this is a new component called the Mobile Conditioning Projector, which cleverly combines information from images and text using a technique that doesn't require a lot of computing power. They trained it with a relatively small amount of data and then refined it using a special training method involving prompts, images, questions, and answers. This allows Mobile-O to be good at both understanding images and creating new ones.

Why it matters?

Mobile-O is important because it's the first model that can perform both image understanding and generation in real-time directly on a mobile device, without needing to send data to the cloud. This opens up possibilities for new applications that don't rely on an internet connection and can process images quickly, like on-device photo editing or interactive educational tools.

Abstract

Unified multimodal models can both understand and generate visual content within a single architecture. Existing models, however, remain data-hungry and too heavy for deployment on edge devices. We present Mobile-O, a compact vision-language-diffusion model that brings unified multimodal intelligence to a mobile device. Its core module, the Mobile Conditioning Projector (MCP), fuses vision-language features with a diffusion generator using depthwise-separable convolutions and layerwise alignment. This design enables efficient cross-modal conditioning with minimal computational cost. Trained on only a few million samples and post-trained in a novel quadruplet format (generation prompt, image, question, answer), Mobile-O jointly enhances both visual understanding and generation capabilities. Despite its efficiency, Mobile-O attains competitive or superior performance compared to other unified models, achieving 74% on GenEval and outperforming Show-O and JanusFlow by 5% and 11%, while running 6x and 11x faster, respectively. For visual understanding, Mobile-O surpasses them by 15.3% and 5.1% averaged across seven benchmarks. Running in only ~3s per 512x512 image on an iPhone, Mobile-O establishes the first practical framework for real-time unified multimodal understanding and generation on edge devices. We hope Mobile-O will ease future research in real-time unified multimodal intelligence running entirely on-device with no cloud dependency. Our code, models, datasets, and mobile application are publicly available at https://amshaker.github.io/Mobile-O/

View Paper