KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Gang Liao, Hongsen Qin, Ying Wang, Alicia Golden, Michael Kuchnik, Yavuz Yetim, Jia Jiunn Ang, Chunli Fu, Yihan He, Samuel Hsia, Zewei Jiang, Dianshi Li, Uladzimir Pashkevich, Varna Puvvada, Feng Shi, Matt Steiner, Ruichao Xiao, Nathan Yan, Xiayu Yu, Zhou Fang, Abdul Zainul-Abedin, Ketan Singh

2025-12-30

KernelEvolve: Scaling Agentic Kernel Coding for Heterogeneous AI Accelerators at Meta

Summary

This paper introduces KernelEvolve, a new system designed to make deep learning recommendation models run faster and more efficiently on different types of computer hardware.

What's the problem?

Training and using complex recommendation models, like those powered by deep learning, is challenging because of three main issues. First, these models come in many different designs. Second, the basic building blocks of these models (called kernels) are also diverse. Finally, the hardware used to run these models – things like NVIDIA, AMD, and even custom-built chips – varies greatly. This makes it hard to write code that works well across all these different combinations, and optimizing performance can take a very long time.

What's the solution?

KernelEvolve solves this by acting like an automated code generator and optimizer. You give it a description of what a kernel needs to do, and it automatically writes and fine-tunes the code to run efficiently on whatever hardware you’re using. It works with different programming languages, from high-level ones like Triton and CuTe to lower-level languages, covering all parts of the hardware and software. It uses a smart search process, constantly adapting based on how the code is actually running to find the best possible performance. It was tested on real-world recommendation models and different GPUs and AI accelerators.

Why it matters?

KernelEvolve is important because it dramatically speeds up the process of getting these models running efficiently, reducing development time from weeks to hours. It also makes it easier to use new and specialized AI hardware, even if it’s custom-built, by automating the creation of the necessary code. This means faster recommendations, lower costs, and the ability to take advantage of the latest hardware innovations.

Abstract

Making deep learning recommendation model (DLRM) training and inference fast and efficient is important. However, this presents three key system challenges - model architecture diversity, kernel primitive diversity, and hardware generation and architecture heterogeneity. This paper presents KernelEvolve-an agentic kernel coding framework-to tackle heterogeneity at-scale for DLRM. KernelEvolve is designed to take kernel specifications as input and automate the process of kernel generation and optimization for recommendation model across heterogeneous hardware architectures. KernelEvolve does so by operating at multiple programming abstractions, from Triton and CuTe DSL to low-level hardware agnostic languages, spanning the full hardware-software optimization stack. The kernel optimization process is described as graph-based search with selection policy, universal operator, fitness function, and termination rule, dynamically adapts to runtime execution context through retrieval-augmented prompt synthesis. We designed, implemented, and deployed KernelEvolve to optimize a wide variety of production recommendation models across generations of NVIDIA and AMD GPUs, as well as Meta's AI accelerators. We validate KernelEvolve on the publicly-available KernelBench suite, achieving 100% pass rate on all 250 problems across three difficulty levels, and 160 PyTorch ATen operators across three heterogeneous hardware platforms, demonstrating 100% correctness. KernelEvolve reduces development time from weeks to hours and achieves substantial performance improvements over PyTorch baselines across diverse production use cases and for heterogeneous AI systems at-scale. Beyond performance efficiency improvements, KernelEvolve significantly mitigates the programmability barrier for new AI hardware by enabling automated kernel generation for in-house developed AI hardware.

View Paper