Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation

Nilesh Jain, Seyi Adeyinka, Leor Roseman, Aza Allsop

2025-12-24

Multi-LLM Thematic Analysis with Dual Reliability Metrics: Combining Cohen's Kappa and Semantic Similarity for Qualitative Research Validation

Summary

This paper explores a new way to reliably analyze qualitative data, like interviews, using large language models (LLMs) – the AI behind things like ChatGPT. It focuses on making sure the AI's analysis is consistent and trustworthy.

What's the problem?

Traditionally, checking if qualitative data analysis is reliable means having multiple people independently code the same data and then comparing their results. This is slow, expensive, and often doesn't show perfect agreement between the coders. Using AI could speed things up, but how do you know if the AI is giving you consistent and trustworthy results if you only run it once?

What's the solution?

The researchers developed a system where they ran three different LLMs (Gemini, GPT-4o, and Claude) multiple times on the same interview transcript. They didn't just look for exact agreement between the AI's analyses, but also how *similar* the meanings were using two different measures: Cohen's Kappa, which measures agreement, and cosine similarity, which measures how closely the AI's interpretations align. They also built a flexible system that lets researchers adjust how the AI analyzes the data and works with different data formats.

Why it matters?

This work is important because it provides a way to use AI to analyze qualitative data more reliably and efficiently. By showing that these LLMs can achieve high levels of agreement and semantic consistency, and by providing an open-source tool for others to use, it opens the door for more researchers to leverage the power of AI in their qualitative research without sacrificing the trustworthiness of their findings.

Abstract

Qualitative research faces a critical reliability challenge: traditional inter-rater agreement methods require multiple human coders, are time-intensive, and often yield moderate consistency. We present a multi-perspective validation framework for LLM-based thematic analysis that combines ensemble validation with dual reliability metrics: Cohen's Kappa (κ) for inter-rater agreement and cosine similarity for semantic consistency. Our framework enables configurable analysis parameters (1-6 seeds, temperature 0.0-2.0), supports custom prompt structures with variable substitution, and provides consensus theme extraction across any JSON format. As proof-of-concept, we evaluate three leading LLMs (Gemini 2.5 Pro, GPT-4o, Claude 3.5 Sonnet) on a psychedelic art therapy interview transcript, conducting six independent runs per model. Results demonstrate Gemini achieves highest reliability (κ= 0.907, cosine=95.3%), followed by GPT-4o (κ= 0.853, cosine=92.6%) and Claude (κ= 0.842, cosine=92.1%). All three models achieve a high agreement (κ> 0.80), validating the multi-run ensemble approach. The framework successfully extracts consensus themes across runs, with Gemini identifying 6 consensus themes (50-83% consistency), GPT-4o identifying 5 themes, and Claude 4 themes. Our open-source implementation provides researchers with transparent reliability metrics, flexible configuration, and structure-agnostic consensus extraction, establishing methodological foundations for reliable AI-assisted qualitative research.

View Paper