GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

Daniel Goldstein, Fares Obeid, Eric Alcaide, Guangyu Song, Eugene Cheah

2024-07-18

GoldFinch: High Performance RWKV/Transformer Hybrid with Linear Pre-Fill and Extreme KV-Cache Compression

Summary

This paper presents GoldFinch, a new type of language model that combines features from both RWKV and traditional transformer models to improve efficiency and performance in generating text.

What's the problem?

Large language models often face challenges related to memory usage and processing speed, especially when they need to handle long sequences of text. Traditional transformer models can be slow and require a lot of memory for their key-value (KV) caches, which store important information needed for generating responses. This makes it difficult to use these models effectively on devices with limited resources.

What's the solution?

GoldFinch addresses these issues by introducing a hybrid model that uses a new technique called linear pre-fill and an extremely compressed KV-cache. This allows the model to generate responses more quickly and efficiently. The GoldFinch model is built on top of an improved version of the Finch architecture (RWKV-6) and is able to save a significant amount of cache space—up to 2550 times smaller than traditional transformers—while still maintaining high performance. It can process long contexts without requiring excessive hardware resources.

Why it matters?

This research matters because it makes advanced language models more accessible and efficient, allowing them to be used in a wider range of applications, even on devices with limited memory. By improving how these models work, GoldFinch could enhance various technologies that rely on natural language understanding, such as chatbots, virtual assistants, and automated content generation.

Abstract

We introduce GoldFinch, a hybrid Linear Attention/Transformer sequence model that uses a new technique to efficiently generate a highly compressed and reusable KV-Cache in linear time and space with respect to sequence length. GoldFinch stacks our new GOLD transformer on top of an enhanced version of the Finch (RWKV-6) architecture. We train up to 1.5B parameter class models of the Finch, Llama, and GoldFinch architectures, and find dramatically improved modeling performance relative to both Finch and Llama. Our cache size savings increase linearly with model layer count, ranging from 756-2550 times smaller than the traditional transformer cache for common sizes, enabling inference of extremely large context lengths even on limited hardware. Although autoregressive generation has O(n) time complexity per token because of attention, pre-fill computation of the entire initial cache state for a submitted context costs only O(1) time per token due to the use of a recurrent neural network (RNN) to generate this cache. We release our trained weights and training code under the Apache 2.0 license for community use.

View Paper