< Explain other AI papers

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

Zifan He, Shengyu Ye, Rui Ma, Yang Wang, Jason Cong

2025-11-11

LUT-LLM: Efficient Large Language Model Inference with Memory-based Computations on FPGAs

Summary

This paper introduces a new way to run large language models (LLMs) on specialized hardware called FPGAs, making them faster and more energy-efficient.

What's the problem?

While LLMs are getting really good, running them quickly and efficiently, especially on smaller devices, is still a challenge. Traditionally, FPGAs were good for this because they could control data very precisely and use less power, but newer GPUs have closed the gap, particularly when it comes to doing a lot of calculations. The core issue is that LLM processing relies heavily on arithmetic operations, where GPUs are now very competitive.

What's the solution?

The researchers came up with a system called LUT-LLM that changes how LLMs are processed on FPGAs. Instead of focusing on doing lots of calculations, it uses the FPGA’s memory to store pre-calculated results in a lookup table. This shifts the work from arithmetic to simply finding the right answer in the table. They optimized this by carefully choosing how to compress the data, searching the table efficiently, and designing the system to minimize how much data needs to be moved around. They tested it on a specific LLM called Qwen 3 1.7B and an AMD V80 FPGA.

Why it matters?

This work is important because it shows a way to make LLMs more practical for use on devices where power and speed are limited, like phones or embedded systems. By outperforming powerful GPUs like the NVIDIA A100 and AMD MI210 in terms of speed and energy efficiency, LUT-LLM demonstrates the potential of FPGAs for running even larger LLMs in the future, potentially enabling more advanced AI applications on a wider range of devices.

Abstract

The rapid progress of large language models (LLMs) has advanced numerous applications, yet efficient single-batch inference remains vital for on-device intelligence. While FPGAs offer fine-grained data control and high energy efficiency, recent GPU optimizations have narrowed their advantage, especially under arithmetic-based computation. To overcome this, we leverage FPGAs' abundant on-chip memory to shift LLM inference from arithmetic- to memory-based computation through table lookups. We present LUT-LLM, the first FPGA accelerator enabling 1B+ LLM inference via vector-quantized memory operations. Our analysis identifies activation-weight co-quantization as the most effective scheme, supported by (1) bandwidth-aware parallel centroid search, (2) efficient 2D table lookups, and (3) a spatial-temporal hybrid design minimizing data caching. Implemented on an AMD V80 FPGA for a customized Qwen 3 1.7B model, LUT-LLM achieves 1.66x lower latency than AMD MI210 and 1.72x higher energy efficiency than NVIDIA A100, scaling to 32B models with 2.16x efficiency gain over A100.