Pixtral 12B
Pravesh Agrawal, Szymon Antoniak, Emma Bou Hanna, Devendra Chaplot, Jessica Chudnovsky, Saurabh Garg, Theophile Gervet, Soham Ghosh, Amélie Héliou, Paul Jacob, Albert Q. Jiang, Timothée Lacroix, Guillaume Lample, Diego Las Casas, Thibaut Lavril, Teven Le Scao, Andy Lo, William Marshall, Louis Martin, Arthur Mensch, Pavankumar Muddireddy, Valera Nemychnikova
2024-10-10

Summary
This paper introduces Pixtral-12B, a multimodal language model that can understand and process both images and text, achieving top performance on various tasks without sacrificing language capabilities.
What's the problem?
While many existing models can handle either text or images well, they often struggle to do both effectively. Additionally, larger models can be less efficient, making it hard to find a balance between performance and resource use. This limits their application in real-world scenarios where both types of data are important.
What's the solution?
Pixtral-12B is designed with 12 billion parameters, allowing it to excel in both image and text tasks. It features a new vision encoder that processes images at their natural size and aspect ratio, which helps maintain quality. The model can handle a long context of up to 128,000 tokens, meaning it can analyze large amounts of data at once. The authors also created an open-source benchmark called MM-MT-Bench to evaluate how well the model performs in practical situations.
Why it matters?
This research is significant because it advances the capabilities of AI models to work with multiple types of data simultaneously. By providing an efficient and effective solution for multimodal tasks, Pixtral-12B can improve applications in areas like image analysis, document understanding, and interactive AI systems. Its open-source nature also encourages further development and innovation in the field.
Abstract
We introduce Pixtral-12B, a 12--billion-parameter multimodal language model. Pixtral-12B is trained to understand both natural images and documents, achieving leading performance on various multimodal benchmarks, surpassing a number of larger models. Unlike many open-source models, Pixtral is also a cutting-edge text model for its size, and does not compromise on natural language performance to excel in multimodal tasks. Pixtral uses a new vision encoder trained from scratch, which allows it to ingest images at their natural resolution and aspect ratio. This gives users flexibility on the number of tokens used to process an image. Pixtral is also able to process any number of images in its long context window of 128K tokens. Pixtral 12B substanially outperforms other open models of similar sizes (Llama-3.2 11B \& Qwen-2-VL 7B). It also outperforms much larger open models like Llama-3.2 90B while being 7x smaller. We further contribute an open-source benchmark, MM-MT-Bench, for evaluating vision-language models in practical scenarios, and provide detailed analysis and code for standardized evaluation protocols for multimodal LLMs. Pixtral-12B is released under Apache 2.0 license.