LLM-Powered Fully Automated Chaos Engineering: Towards Enabling Anyone to Build Resilient Software Systems at Low Cost

Daisuke Kikuta, Hiroki Ikeuchi, Kengo Tajiri

2025-11-19

LLM-Powered Fully Automated Chaos Engineering: Towards Enabling Anyone to Build Resilient Software Systems at Low Cost

Summary

This paper introduces ChaosEater, a new system that uses artificial intelligence, specifically large language models, to automatically improve the reliability of complex software systems, particularly those built using Kubernetes.

What's the problem?

Currently, testing how well a software system handles failures – a process called Chaos Engineering – is done mostly by hand. It takes a lot of time, effort, and requires experts in many different areas to plan the tests, run them, and then figure out how to fix any problems found. This makes building truly reliable systems expensive and difficult for many teams.

What's the solution?

ChaosEater solves this by automating the entire Chaos Engineering process. It uses large language models to handle all the steps, from figuring out *what* kinds of failures to test for, to actually *causing* those failures in a controlled way, to analyzing the results and even suggesting code changes to fix weaknesses. It’s designed to work specifically with software running on Kubernetes, and the AI completes tasks like writing requirements, generating code for tests, and debugging issues.

Why it matters?

This work is important because it makes building resilient systems much more accessible and affordable. By automating Chaos Engineering with AI, teams don't need to be large or have specialized expertise to proactively identify and fix weaknesses in their software, ultimately leading to more stable and reliable applications.

Abstract

Chaos Engineering (CE) is an engineering technique aimed at improving the resilience of distributed systems. It involves intentionally injecting faults into a system to test its resilience, uncover weaknesses, and address them before they cause failures in production. Recent CE tools automate the execution of predefined CE experiments. However, planning such experiments and improving the system based on the experimental results still remain manual. These processes are labor-intensive and require multi-domain expertise. To address these challenges and enable anyone to build resilient systems at low cost, this paper proposes ChaosEater, a system that automates the entire CE cycle with Large Language Models (LLMs). It predefines an agentic workflow according to a systematic CE cycle and assigns subdivided processes within the workflow to LLMs. ChaosEater targets CE for software systems built on Kubernetes. Therefore, the LLMs in ChaosEater complete CE cycles through software engineering tasks, including requirement definition, code generation, testing, and debugging. We evaluate ChaosEater through case studies on small- and large-scale Kubernetes systems. The results demonstrate that it consistently completes reasonable CE cycles with significantly low time and monetary costs. Its cycles are also qualitatively validated by human engineers and LLMs.

View Paper