NVILA: Efficient Frontier Visual Language Models
Zhijian Liu, Ligeng Zhu, Baifeng Shi, Zhuoyang Zhang, Yuming Lou, Shang Yang, Haocheng Xi, Shiyi Cao, Yuxian Gu, Dacheng Li, Xiuyu Li, Yunhao Fang, Yukang Chen, Cheng-Yu Hsieh, De-An Huang, An-Chieh Cheng, Vishwesh Nath, Jinyi Hu, Sifei Liu, Ranjay Krishna, Daguang Xu, Xiaolong Wang
2024-12-06

Summary
This paper talks about NVILA, a new type of visual language model that focuses on being both efficient and accurate, allowing it to handle high-resolution images and long videos effectively.
What's the problem?
While visual language models have improved in accuracy, they often use too many resources, making them inefficient. This means they can be slow and costly to train and use, which limits their practical applications.
What's the solution?
The authors developed NVILA by enhancing its architecture through a method called 'scale-then-compress.' First, they increase the resolution of images and videos to capture more details. Then, they compress the visual tokens to reduce the amount of data the model needs to process. This approach not only improves efficiency but also maintains or even boosts accuracy compared to other models. Additionally, they optimized the entire process from training to deployment, resulting in significant reductions in costs and processing times.
Why it matters?
This research is important because it makes visual language models more accessible and practical for real-world use, such as in robotics or medical imaging. By reducing costs and improving speed while keeping accuracy high, NVILA can help advance technology that understands both images and text better. The authors plan to release their code and models for others to use and build upon.
Abstract
Visual language models (VLMs) have made significant advances in accuracy in recent years. However, their efficiency has received much less attention. This paper introduces NVILA, a family of open VLMs designed to optimize both efficiency and accuracy. Building on top of VILA, we improve its model architecture by first scaling up the spatial and temporal resolutions, and then compressing visual tokens. This "scale-then-compress" approach enables NVILA to efficiently process high-resolution images and long videos. We also conduct a systematic investigation to enhance the efficiency of NVILA throughout its entire lifecycle, from training and fine-tuning to deployment. NVILA matches or surpasses the accuracy of many leading open and proprietary VLMs across a wide range of image and video benchmarks. At the same time, it reduces training costs by 4.5X, fine-tuning memory usage by 3.4X, pre-filling latency by 1.6-2.2X, and decoding latency by 1.2-2.8X. We will soon make our code and models available to facilitate reproducibility.