Phantom of Latent for Large Language and Vision Models

Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro

2024-09-24

Phantom of Latent for Large Language and Vision Models

Summary

This paper discusses Phantom, a new family of efficient large language and vision models (LLVMs) designed to perform well while using fewer resources. It focuses on improving how these models learn and understand information without needing to be as large as previous models.

What's the problem?

As models like LLVMs have grown larger, reaching sizes of up to 80 billion parameters, they have become more powerful but also require a lot more computer power and memory to run. This makes it difficult for many users to access and use these advanced models effectively. There is a need for smaller models that can still deliver high performance without the heavy resource demands.

What's the solution?

To address this issue, the researchers developed the Phantom model family, which includes sizes ranging from 0.5 billion to 7 billion parameters. They introduced a technique that temporarily increases the model's ability to process information during specific tasks without permanently increasing its size. This method, called Phantom Optimization (PO), combines two training strategies: autoregressive supervised fine-tuning (SFT) and direct preference optimization (DPO). This allows the model to learn better from correct examples while avoiding mistakes, making it more efficient in understanding language and vision tasks.

Why it matters?

This research matters because it provides a way to create powerful AI models that are accessible to more people and organizations. By developing smaller yet effective models, Phantom can help advance applications in various fields like healthcare, education, and entertainment, where having efficient AI tools is crucial for innovation and productivity.

Abstract

The success of visual instruction tuning has accelerated the development of large language and vision models (LLVMs). Following the scaling laws of instruction-tuned large language models (LLMs), LLVMs either have further increased their sizes, reaching 26B, 34B, and even 80B parameters. While this increase in model size has yielded significant performance gains, it demands substantially more hardware resources for both training and inference. Consequently, there naturally exists a strong need for efficient LLVMs that achieve the performance of larger models while being smaller in size. To achieve this need, we present a new efficient LLVM family with model sizes of 0.5B, 1.8B, 3.8B, and 7B parameters, Phantom, which significantly enhances learning capabilities within limited structures. By temporarily increasing the latent hidden dimension during multi-head self-attention (MHSA), we make LLVMs prepare to look and understand much more vision-language knowledge on the latent, without substantially increasing physical model sizes. To maximize its advantage, we introduce Phantom Optimization (PO) using both autoregressive supervised fine-tuning (SFT) and direct preference optimization (DPO)-like concept, which effectively follows correct answers while eliminating incorrect and ambiguous ones. Phantom outperforms numerous larger open- and closed-source LLVMs, positioning itself as a leading solution in the landscape of efficient LLVMs.

View Paper