Cache-to-Cache: Direct Semantic Communication Between Large Language Models
Tianyu Fu, Zihan Min, Hanling Zhang, Jichao Yan, Guohao Dai, Wanli Ouyang, Yu Wang
2025-10-09
Summary
This paper introduces a new way for different Large Language Models (LLMs) to work together, called Cache-to-Cache (C2C), to get better results and work faster.
What's the problem?
Currently, when multiple LLMs collaborate, they usually communicate by sending text messages back and forth. This isn't ideal because converting the LLM's internal understanding of information into text, and then back again, loses important details and slows down the process. It's like trying to explain a complex idea by only using simple words – you lose nuance and it takes longer.
What's the solution?
The researchers found that LLMs store information in something called a 'KV-Cache' which holds a more complete representation of what they 'know'. C2C allows LLMs to directly share this KV-Cache information with each other, bypassing the need for text-based communication. They use a small neural network to translate and combine the caches, and a 'gate' to decide which parts of the receiving LLM should use the new information. This is a more direct and efficient way to transfer knowledge.
Why it matters?
This research is important because it significantly improves the accuracy of LLM systems working together, boosting performance by 8.5-10.5% compared to individual models and 3.0-5.0% over text-based communication. It also makes these systems much faster, with a 2x speedup. This means we can build more powerful and responsive AI applications that leverage the strengths of multiple LLMs.
Abstract
Multi-LLM systems harness the complementary strengths of diverse Large Language Models, achieving performance and efficiency gains unattainable by a single model. In existing designs, LLMs communicate through text, forcing internal representations to be transformed into output token sequences. This process both loses rich semantic information and incurs token-by-token generation latency. Motivated by these limitations, we ask: Can LLMs communicate beyond text? Oracle experiments show that enriching the KV-Cache semantics can improve response quality without increasing cache size, supporting KV-Cache as an effective medium for inter-model communication. Thus, we propose Cache-to-Cache (C2C), a new paradigm for direct semantic communication between LLMs. C2C uses a neural network to project and fuse the source model's KV-cache with that of the target model to enable direct semantic transfer. A learnable gating mechanism selects the target layers that benefit from cache communication. Compared with text communication, C2C utilizes the deep, specialized semantics from both models, while avoiding explicit intermediate text generation. Experiments show that C2C achieves 8.5-10.5% higher average accuracy than individual models. It further outperforms the text communication paradigm by approximately 3.0-5.0%, while delivering an average 2.0x speedup in latency. Our code is available at https://github.com/thu-nics/C2C.