< Explain other AI papers

FFN Fusion: Rethinking Sequential Computation in Large Language Models

Akhiad Bercovich, Mohammad Dabbah, Omri Puny, Ido Galil, Amnon Geifman, Yonatan Geifman, Izhak Golan, Ehud Karpas, Itay Levy, Zach Moshe, Najeeb Nabwani, Tomer Ronen, Itamar Schen, Elad Segal, Ido Shahaf, Oren Tropp, Ran Zilberstein, Ran El-Yaniv

2025-03-25

FFN Fusion: Rethinking Sequential Computation in Large Language Models

Summary

This paper is about making large language models faster by finding ways to do some of the calculations at the same time instead of one after the other.

What's the problem?

Large language models are powerful, but they can be slow because they have to perform many calculations in a specific order.

What's the solution?

The researchers developed a technique called FFN Fusion that identifies parts of the model that can be calculated in parallel, speeding up the overall process.

Why it matters?

This work matters because it can make large language models more efficient and accessible, allowing them to be used in more applications.

Abstract

We introduce FFN Fusion, an architectural optimization technique that reduces sequential computation in large language models by identifying and exploiting natural opportunities for parallelization. Our key insight is that sequences of Feed-Forward Network (FFN) layers, particularly those remaining after the removal of specific attention layers, can often be parallelized with minimal accuracy impact. We develop a principled methodology for identifying and fusing such sequences, transforming them into parallel operations that significantly reduce inference latency while preserving model behavior. Applying these techniques to Llama-3.1-405B-Instruct, we create Llama-Nemotron-Ultra-253B-Base (Ultra-253B-Base), an efficient and soon-to-be publicly available model that achieves a 1.71X speedup in inference latency and 35X lower per-token cost while maintaining strong performance across benchmarks. Through extensive experiments on models from 49B to 253B parameters, we demonstrate that FFN Fusion becomes increasingly effective at larger scales and can complement existing optimization techniques like quantization and pruning. Most intriguingly, we find that even full transformer blocks containing both attention and FFN layers can sometimes be parallelized, suggesting new directions for neural architecture design.