Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

Minchan Kwon, Sunghyun Baek, Minseo Kim, Jaemyung Yu, Dongyoon Han, Junmo Kim

2026-05-04

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

Summary

This paper focuses on improving a technique called 'red-teaming' for large language models (LLMs). Red-teaming is like stress-testing an LLM to find its weaknesses and make sure it's safe and doesn't produce harmful outputs.

What's the problem?

Finding effective ways to 'attack' LLMs during red-teaming is difficult. You want attacks that are both good at exposing vulnerabilities *and* varied, so you don't just find the same problem over and over. A promising method called Generative Flow Networks (GFNs) struggles with unstable training, meaning it often fails to learn effectively and gets stuck producing repetitive or nonsensical outputs, especially when the feedback it receives is noisy like in red-teaming.

What's the solution?

The researchers developed a new version of GFN called Stable-GFN (S-GFN). S-GFN simplifies the training process by removing a complicated calculation that caused instability. It also uses a clever method to compare potential attacks and a technique to filter out bad feedback, making the training much more reliable. Finally, it includes a 'fluency stabilizer' to prevent the model from getting stuck generating gibberish.

Why it matters?

This work is important because it makes red-teaming LLMs more effective. By creating a more stable and reliable method for finding vulnerabilities, we can build safer and more trustworthy AI systems. S-GFN consistently finds better and more diverse attacks than previous methods, meaning it can help developers identify and fix more problems in their LLMs.

Abstract

Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse. In particular, unstable rewards in red-teaming accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates partition function Z estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings.

View Paper