Attributes as Textual Genes: Leveraging LLMs as Genetic Algorithm Simulators for Conditional Synthetic Data Generation

Guangzeng Han, Weisi Liu, Xiaolei Huang

2025-09-03

Attributes as Textual Genes: Leveraging LLMs as Genetic Algorithm Simulators for Conditional Synthetic Data Generation

Summary

This paper introduces a new method called Genetic Prompt for creating realistic, varied synthetic data using large language models, which are AI systems good at generating text.

What's the problem?

Large language models can *make* data, but it's often not good enough. The generated data might not be diverse enough, meaning it doesn't represent the real world very well, or it might lack quality, making it useless for training other AI models. Essentially, creating synthetic data that's actually helpful is a challenge.

What's the solution?

The researchers used an idea from biology – genetic algorithms – combined with the power of large language models. They treated important characteristics of the text as 'genes'. The language model then 'breeds' these characteristics together, mixing and slightly changing them to create new, diverse data. They also added a smart system to help pick the best 'parent' characteristics to create even better results. This process creates synthetic data that more closely resembles real-world data.

Why it matters?

This is important because good synthetic data can help train AI models, especially when real data is limited or biased. The Genetic Prompt method works better than existing techniques and improves the performance of AI models, particularly when dealing with situations where some categories have much less data than others. This means it can help build more reliable and fair AI systems for a wide range of language-based tasks.

Abstract

Large Language Models (LLMs) excel at generating synthetic data, but ensuring its quality and diversity remains challenging. We propose Genetic Prompt, a novel framework that combines genetic algorithms with LLMs to augment synthetic data generation. Our approach treats semantic text attributes as gene sequences and leverages the LLM to simulate crossover and mutation operations. This genetic process enhances data quality and diversity by creating novel attribute combinations, yielding synthetic distributions closer to real-world data. To optimize parent selection, we also integrate an active learning scheme that expands the offspring search space. Our experiments on multiple NLP tasks reveal several key findings: Genetic Prompt not only significantly outperforms state-of-the-art baselines but also shows robust performance across various generator model sizes and scales. Moreover, we demonstrate that fusing our synthetic data with the original training set significantly boosts downstream model performance, particularly for class-imbalanced scenarios. Our findings validate that Genetic Prompt is an effective method for producing high-quality synthetic data for a wide range of NLP applications.

View Paper