Key Features

Compresses LLM KV caches with block-diagonal rotations.
Improves decode speed for long-context inference workloads.
Improves prefill speed compared with reported baselines.
Reduces parameter overhead for cache compression.
Targets drop-in integration with llama.cpp.
Helps reduce memory pressure during autoregressive generation.
Preserves quality through rotation-based cache representation.
Provides public source code for inference-system experimentation.

Technically, Rotorquant compresses key-value cache representations through block-diagonal rotational transforms. KV cache compression matters because autoregressive decoding stores attention states across tokens, and those states become expensive as context length grows. By integrating as a drop-in llama.cpp path, Rotorquant targets real inference stacks rather than only theoretical compression benchmarks.


Rotorquant is valuable for developers running local or hosted LLM inference where memory bandwidth, prefill speed, and decode latency matter. It can help make longer contexts more practical on constrained hardware while keeping output quality closer to the uncompressed baseline.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner
Zero to AI Engineer Program

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!