ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

Lin Yueyu, Li Zhiyuan, Peter Yue, Liu Xiao

2025-01-28

ARWKV: Pretrain is not what we need, an RNN-Attention-Based Language Model Born from Transformer

Summary

This paper talks about a new approach to reinforcement learning (RL) called MR.Q. It aims to create a single, versatile algorithm that can solve many different types of problems without needing to be adjusted for each specific task.

What's the problem?

Current RL methods often need to be fine-tuned for specific tasks, which limits their usefulness. While some newer methods can handle multiple tasks, they're complex and slow. This makes it hard to use RL for a wide range of real-world problems efficiently.

What's the solution?

The researchers developed MR.Q, a model-free RL algorithm that borrows some ideas from model-based methods. It uses a clever way to represent information about the task, making it easier for the algorithm to learn. This approach allows MR.Q to work well on many different types of problems without needing to change its settings. They tested MR.Q on various common RL challenges and found it performed well compared to other methods, even those designed for specific tasks.

Why it matters?

This research matters because it's a step towards creating more flexible and efficient AI systems. If we can have a single algorithm that can tackle many different problems without needing constant adjustments, it could make AI more practical and useful in real-world situations. This could lead to advances in areas like robotics, game AI, and automated decision-making systems. It also shows that we can combine the best parts of different RL approaches to create something new and potentially more powerful.

Abstract

As is known, hybrid quadratic and subquadratic attention models in multi-head architectures have surpassed both Transformer and Linear RNN models , with these works primarily focusing on reducing KV complexity and improving efficiency. For further research on expressiveness, we introduce our series of models distilled from Qwen 2.5, based on pure native RWKV-7 attention, which aims to make RNN more expressive and demonstrates state tracking ability beyond transformers. We work with QRWK 32B based on RWKV-6 architecture, another approach that reduces the entire knowledge processing time to just 8 hours using 16 AMD MI300X GPUs while maintaining Qwen 2.5's performance. In fact, the distillation process can utilize any LLM, not just Qwen, and enables knowledge transfer from larger LLMs to smaller ones with more fewer tokens. We will explain the detailed process and share our insights on building more powerful foundation models. Please note that this is an ongoing work that will be updated continuously. The model checkpoints and source code are available at https://github.com/yynil/RWKVInside{https://github.com/yynil/RWKVInside}, https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1{https://huggingface.co/RWKV-Red-Team/ARWKV-7B-Preview-0.1}.

View Paper