Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM Fine-Tuning Data from Unstructured Documents

Ziyang Miao, Qiyu Sun, Jingyuan Wang, Yuchen Gong, Yaowei Zheng, Shiqi Li, Richong Zhang

2025-07-08

Easy Dataset: A Unified and Extensible Framework for Synthesizing LLM
Fine-Tuning Data from Unstructured Documents

Summary

This paper talks about Easy Dataset, a new framework that helps create fine-tuning data for large language models by converting unstructured documents into useful labeled data using a graphical interface and language models.

What's the problem?

The problem is that preparing quality fine-tuning data for language models is often time-consuming and difficult because many documents are unorganized and not labeled properly, especially for specific domains.

What's the solution?

The researchers built Easy Dataset, which allows users to easily generate and organize training data by interacting with unstructured documents through a user-friendly interface. It uses large language models to automatically extract and structure data for fine-tuning, improving performance in specialized areas without losing general knowledge.

Why it matters?

This matters because it makes it easier and faster to customize language models for specific tasks or industries, helping AI work better in real-world applications like healthcare, law, and customer service.

Abstract

A framework called Easy Dataset synthesizes fine-tuning data from unstructured documents using a GUI and LLMs, improving domain-specific performance of LLMs while maintaining general knowledge.

View Paper