ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

Lawrence Liu, Alexander Liu, Mengdi Wang, Tuo Zhao, Lin F. Yang

2025-10-13

ARMOR: High-Performance Semi-Structured Pruning via Adaptive Matrix Factorization

Summary

This paper introduces a new method, called ARMOR, for making large language models smaller and faster without losing too much accuracy.

What's the problem?

Large language models are incredibly powerful, but they require a lot of computing power and memory to run, making them difficult to deploy and use widely. A common technique to reduce these requirements is 'pruning,' which removes unnecessary parts of the model, but existing pruning methods often significantly decrease the model's performance.

What's the solution?

ARMOR tackles this problem by cleverly restructuring the model's weights instead of simply deleting them. It breaks down each weight matrix into a core part that's pruned using a specific pattern (2:4 sparsity) and then surrounds it with smaller, more manageable 'wrapper' matrices. These wrappers help correct any errors introduced by the pruning, preserving the model's accuracy. The algorithm figures out the best core and wrappers by gradually improving them layer by layer, and it's mathematically proven to perform at least as well as other pruning techniques.

Why it matters?

This research is important because it offers a more effective way to compress large language models. ARMOR achieves better performance than existing methods while still providing the benefits of reduced memory usage and faster processing speeds, making these powerful models more accessible and practical for real-world applications.

Abstract

Large language models (LLMs) present significant deployment challenges due to their immense computational and memory requirements. While semi-structured pruning, particularly 2:4 sparsity, offers a path to practical hardware acceleration, existing methods often incur substantial performance degradation. To bridge this gap, we introduce ARMOR: (Adaptive Representation with Matrix-factORization), a novel one-shot post-training pruning algorithm. Instead of directly pruning weights, ARMOR factorizes each weight matrix into a 2:4 sparse core wrapped by two low-overhead, block diagonal matrices. These wrappers act as efficient pre and post-transformation error correctors, offering greater flexibility to preserve model quality compared to conventional 2:4 pruning techniques. The sparse core and block diagonal wrappers are chosen through a block coordinate descent algorithm that minimizes a layer-wise proxy loss. We theoretically prove this optimization is guaranteed to converge to a solution with a proxy loss less than or equal to state-of-the-art pruning algorithms. Experiments on Llama (Touvron et al., 2023; Dubey et al., 2024) and Qwen (Yang et al., 2025) model families demonstrate that ARMOR consistently and significantly outperforms state-of-the-art 2:4 pruning methods across a wide range of downstream tasks and perplexity evaluations. ARMOR achieves this superior performance while retaining the inference speedups and substantial memory usage reductions of 2:4 pruning, establishing a more effective trade-off between model compression and task accuracy

View Paper