Rotorquant

NEW

Free Compression Open-Source

LikeWebsite Promote

Key Features

Compresses LLM KV caches with block-diagonal rotations.

Improves decode speed for long-context inference workloads.

Improves prefill speed compared with reported baselines.

Reduces parameter overhead for cache compression.

Targets drop-in integration with llama.cpp.

Helps reduce memory pressure during autoregressive generation.

Preserves quality through rotation-based cache representation.

Provides public source code for inference-system experimentation.

Technically, Rotorquant compresses key-value cache representations through block-diagonal rotational transforms. KV cache compression matters because autoregressive decoding stores attention states across tokens, and those states become expensive as context length grows. By integrating as a drop-in llama.cpp path, Rotorquant targets real inference stacks rather than only theoretical compression benchmarks.

Rotorquant is valuable for developers running local or hosted LLM inference where memory bandwidth, prefill speed, and decode latency matter. It can help make longer contexts more practical on constrained hardware while keeping output quality closer to the uncompressed baseline.

Get more likes & reach the top of search results by adding this button on your site!

Rotorquant

Key Features

Zero to AI Engineer

Subscribe to the AI Search Newsletter