MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented Generation System

Jihao Zhao, Zhiyuan Ji, Zhaoxin Fan, Hanyu Wang, Simin Niu, Bo Tang, Feiyu Xiong, Zhiyu Li

2025-03-13

MoC: Mixtures of Text Chunking Learners for Retrieval-Augmented
Generation System

Summary

This paper talks about MoC, a smarter way to split text into chunks for AI systems that need to find and use information from big documents, like study guides or research papers.

What's the problem?

Current methods for splitting text into chunks either miss important details or take too much computer power, making AI answers less accurate or too slow.

What's the solution?

MoC uses a mix of AI methods to break text into better chunks by learning patterns and rules, then tests them with new metrics to see how clear and useful the chunks are.

Why it matters?

This helps AI tools give better answers faster, especially for tasks like homework help or finding specific info in long documents, without needing tons of computer resources.

Abstract

Retrieval-Augmented Generation (RAG), while serving as a viable complement to large language models (LLMs), often overlooks the crucial aspect of text chunking within its pipeline. This paper initially introduces a dual-metric evaluation method, comprising Boundary Clarity and Chunk Stickiness, to enable the direct quantification of chunking quality. Leveraging this assessment method, we highlight the inherent limitations of traditional and semantic chunking in handling complex contextual nuances, thereby substantiating the necessity of integrating LLMs into chunking process. To address the inherent trade-off between computational efficiency and chunking precision in LLM-based approaches, we devise the granularity-aware Mixture-of-Chunkers (MoC) framework, which consists of a three-stage processing mechanism. Notably, our objective is to guide the chunker towards generating a structured list of chunking regular expressions, which are subsequently employed to extract chunks from the original text. Extensive experiments demonstrate that both our proposed metrics and the MoC framework effectively settle challenges of the chunking task, revealing the chunking kernel while enhancing the performance of the RAG system.

View Paper