Potential and Perils of Large Language Models as Judges of Unstructured Textual Data

Rewina Bedemariam, Natalie Perez, Sreyoshi Bhaduri, Satya Kapoor, Alex Gil, Elizabeth Conjar, Ikkei Itoku, David Theil, Aman Chadha, Naumaan Nayyar

2025-01-15

Potential and Perils of Large Language Models as Judges of Unstructured Textual Data

Summary

This paper talks about using AI language models (LLMs) to analyze and summarize large amounts of text data, like survey responses. It explores whether these AI models can be trusted to accurately judge and summarize the main ideas in text data as well as humans can.

What's the problem?

As more organizations use AI to understand text data from surveys or feedback, there's a worry that the AI might not always capture the true meaning of what people are saying. If the AI misses important points or misunderstands the context, it could lead to wrong decisions being made based on that information.

What's the solution?

The researchers tested different AI models to see how well they could summarize and judge the main themes in survey responses. They compared the AI's performance to human evaluators using special statistical methods. They used one AI model (Claude) to create summaries and other AI models (like Amazon's Titan Express and Meta's Llama) to judge those summaries.

Why it matters?

This research matters because it helps us understand how well we can trust AI to handle important text data. If AI can do this job well, it could save a lot of time and money for organizations that need to analyze large amounts of text. However, the study also shows that humans might still be better at catching subtle meanings that AI might miss. This information can help companies and researchers make better decisions about when and how to use AI for analyzing text data.

Abstract

Rapid advancements in large language models have unlocked remarkable capabilities when it comes to processing and summarizing unstructured text data. This has implications for the analysis of rich, open-ended datasets, such as survey responses, where LLMs hold the promise of efficiently distilling key themes and sentiments. However, as organizations increasingly turn to these powerful AI systems to make sense of textual feedback, a critical question arises, can we trust LLMs to accurately represent the perspectives contained within these text based datasets? While LLMs excel at generating human-like summaries, there is a risk that their outputs may inadvertently diverge from the true substance of the original responses. Discrepancies between the LLM-generated outputs and the actual themes present in the data could lead to flawed decision-making, with far-reaching consequences for organizations. This research investigates the effectiveness of LLMs as judge models to evaluate the thematic alignment of summaries generated by other LLMs. We utilized an Anthropic Claude model to generate thematic summaries from open-ended survey responses, with Amazon's Titan Express, Nova Pro, and Meta's Llama serving as LLM judges. The LLM-as-judge approach was compared to human evaluations using Cohen's kappa, Spearman's rho, and Krippendorff's alpha, validating a scalable alternative to traditional human centric evaluation methods. Our findings reveal that while LLMs as judges offer a scalable solution comparable to human raters, humans may still excel at detecting subtle, context-specific nuances. This research contributes to the growing body of knowledge on AI assisted text analysis. We discuss limitations and provide recommendations for future research, emphasizing the need for careful consideration when generalizing LLM judge models across various contexts and use cases.

View Paper