Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema Adherence

Bhavik Agarwal, Ishan Joshi, Viktoria Rojkova

2025-02-24

Think Inside the JSON: Reinforcement Strategy for Strict LLM Schema
Adherence

Summary

This paper talks about a new way to teach AI language models to follow strict rules when creating structured data, like JSON, by using a special learning method called reinforcement learning.

What's the problem?

AI language models are great at generating text, but they often struggle to follow specific formats or structures consistently. This is a big issue when we need the AI to create data in a particular format, like JSON, which is commonly used for storing and exchanging information between computer systems.

What's the solution?

The researchers created a new approach called ThinkJSON. They used a smaller AI model (1.5 billion parameters) and trained it using reinforcement learning, which is like teaching through trial and error with rewards for good performance. They first taught the model basic reasoning skills using a large dataset, then fine-tuned it to follow specific data structures. This process was relatively quick, taking only about 23 hours total on high-powered computers.

Why it matters?

This matters because it shows we can teach AI to follow strict data formats without needing huge models or lots of training time. This makes it easier and cheaper to create AI systems that can reliably produce structured data for real-world applications. It could help businesses and developers who need AI to generate consistent, properly formatted information for their systems.

Abstract

In this paper, we address the challenge of enforcing strict schema adherence in large language model (LLM) generation by leveraging LLM reasoning capabilities. Building on the DeepSeek R1 reinforcement learning framework, our approach trains structured reasoning skills of a 1.5B parameter model through a novel pipeline that combines synthetic reasoning dataset construction with custom reward functions under Group Relative Policy Optimization (GRPO). Specifically, we first perform R1 reinforcement learning on a 20K sample unstructured-to-structured dataset, mirroring the original DeepSeek R1 methods, to establish core reasoning abilities. Subsequently, we performed supervised fine-tuning on a separate 10K reasoning sample dataset, focusing on refining schema adherence for downstream tasks. Despite the relatively modest training scope, requiring approximately 20 hours on an 8xH100 GPU cluster for GRPO training and 3 hours on 1xA100 for SFT, our model demonstrates robust performance in enforcing schema consistency. We compare our ThinkJSON approach against the original DeepSeek R1 (671B), distilled versions of DeepSeek R1 (Qwen-1.5B and Qwen-7B), and Gemini 2.0 Flash (70B), showcasing its effectiveness in real-world applications. Our results underscore the practical utility of a resource-efficient framework for schema-constrained text generation.

View Paper