Initialisation Determines the Basin: Efficient Codebook Optimisation for Extreme LLM Quantization
Ian W. Kennedy, Nafise Sadat Moosavi
2026-04-13
Summary
This paper investigates why compressing large language models (LLMs) to very small sizes, specifically using 2-bit precision, is so difficult, even when using advanced techniques to find the best compressed version of the model.
What's the problem?
When you try to drastically shrink an LLM using a method called additive quantization, it often fails badly at 2-bit precision. The main issue isn't the fine-tuning process itself, but *how* the initial 'codebook' – essentially the lookup table used for compression – is created. The standard way of building this codebook, step-by-step, often gets stuck in a bad starting point that even powerful search and tuning methods can't recover from, leading to a huge drop in performance.
What's the solution?
The researchers developed a new method called OA-EM for initializing the codebook. Instead of building it sequentially, OA-EM considers the overall structure of the model and uses a more sophisticated approach based on something called 'Hessian-weighted Mahalanobis distance' to find a better starting point. This helps the model avoid those bad optimization regions and achieve much better compression results.
Why it matters?
This work shows that the initial setup is incredibly important when compressing LLMs. It's not just about how well you can *tune* a compressed model, but where you *start* the tuning process. This is especially true when compressing to very small sizes, and understanding this 'optimization geometry' is crucial for making LLMs small enough to run efficiently on devices like phones or embedded systems.
Abstract
Additive quantization enables extreme LLM compression with O(1) lookup-table dequantization, making it attractive for edge deployment. Yet at 2-bit precision, it often fails catastrophically, even with extensive search and finetuning. We show that the dominant bottleneck is codebook initialisation. Greedy sequential initialisation frequently places the model in poor optimisation regions that subsequent beam search and PV-tuning struggle to overcome. We analyse this behaviour through the representational ratio ho = N/KM, which characterises the relationship between weight groups and codebook capacity, and propose OA-EM, an output-aware EM initialisation method using Hessian-weighted Mahalanobis distance. Across compression rates, search budgets, and three architectures (Llama 3.2 3B, Llama 3.1 8B, Qwen 2.5 3B), OA-EM consistently produces better solutions after PV-tuning and dominates the quality-compute frontier. The severity of the bottleneck scales with ho: moderate at 3 bpp but extreme at 2 bpp, where poor initialisation can degrade perplexity by orders of magnitude. More broadly, our results highlight the importance of optimisation geometry in compressed model spaces, where initialisation can dominate subsequent search and fine-tuning.