TroL: Traversal of Layers for Large Language and Vision Models
Byung-Kwan Lee, Sangyun Chung, Chae Won Kim, Beomchan Park, Yong Man Ro
2024-06-19

Summary
This paper introduces TroL, a new family of efficient language and vision models that can perform complex tasks without needing extremely large sizes. TroL uses a technique called layer traversing to improve performance while keeping the model sizes manageable.
What's the problem?
Large language and vision models (LLMs and LLVMs) have become powerful tools for understanding and generating text and images. However, many existing models are very large (with billions of parameters), which makes them expensive to train and use. This limits their accessibility, as they require high-end computing resources that not everyone can afford. Additionally, these large models often do not provide significant improvements in performance compared to smaller models.
What's the solution?
To solve this problem, the authors developed TroL, which includes smaller model sizes (1.8 billion, 3.8 billion, and 7 billion parameters) that still achieve high performance. TroL uses a method called layer traversing, which allows the model to reuse layers effectively. Instead of adding more layers physically, TroL simulates the effect of looking back at previous layers during processing. This means it can handle more information without becoming larger and more resource-intensive. The authors tested TroL and found that it outperformed larger open-source models and performed comparably to some closed-source models.
Why it matters?
This research is important because it makes advanced language and vision models more accessible by reducing their size while maintaining performance. By developing TroL, the authors provide a way for more researchers and developers to use powerful AI tools without needing expensive hardware. This could lead to broader applications of AI in various fields, such as education, entertainment, and technology.
Abstract
Large language and vision models (LLVMs) have been driven by the generalization power of large language models (LLMs) and the advent of visual instruction tuning. Along with scaling them up directly, these models enable LLVMs to showcase powerful vision language (VL) performances by covering diverse tasks via natural language instructions. However, existing open-source LLVMs that perform comparably to closed-source LLVMs such as GPT-4V are often considered too large (e.g., 26B, 34B, and 110B parameters), having a larger number of layers. These large models demand costly, high-end resources for both training and inference. To address this issue, we present a new efficient LLVM family with 1.8B, 3.8B, and 7B LLM model sizes, Traversal of Layers (TroL), which enables the reuse of layers in a token-wise manner. This layer traversing technique simulates the effect of looking back and retracing the answering stream while increasing the number of forward propagation layers without physically adding more layers. We demonstrate that TroL employs a simple layer traversing approach yet efficiently outperforms the open-source LLVMs with larger model sizes and rivals the performances of the closed-source LLVMs with substantial sizes.