PlainQAFact: Automatic Factuality Evaluation Metric for Biomedical Plain Language Summaries Generation

Zhiwen You, Yue Guo

2025-03-12

PlainQAFact: Automatic Factuality Evaluation Metric for Biomedical Plain
Language Summaries Generation

Summary

This paper talks about PlainQAFact, a tool that checks if AI-made medical summaries for regular people are accurate, especially when they add extra explanations not in the original text.

What's the problem?

Current tools can’t tell if easy-to-read medical summaries are factually correct, especially when AI adds helpful background info that wasn’t in the original documents.

What's the solution?

PlainQAFact uses a smart system that first checks what type of info is added and then verifies facts using a question-answering method, trained on a special dataset made by humans.

Why it matters?

This helps patients and families trust medical summaries by catching AI mistakes, making sure health advice is safe and reliable for everyday decisions.

Abstract

Hallucinated outputs from language models pose risks in the medical domain, especially for lay audiences making health-related decisions. Existing factuality evaluation methods, such as entailment- and question-answering-based (QA), struggle with plain language summary (PLS) generation due to elaborative explanation phenomenon, which introduces external content (e.g., definitions, background, examples) absent from the source document to enhance comprehension. To address this, we introduce PlainQAFact, a framework trained on a fine-grained, human-annotated dataset PlainFact, to evaluate the factuality of both source-simplified and elaboratively explained sentences. PlainQAFact first classifies factuality type and then assesses factuality using a retrieval-augmented QA-based scoring method. Our approach is lightweight and computationally efficient. Empirical results show that existing factuality metrics fail to effectively evaluate factuality in PLS, especially for elaborative explanations, whereas PlainQAFact achieves state-of-the-art performance. We further analyze its effectiveness across external knowledge sources, answer extraction strategies, overlap measures, and document granularity levels, refining its overall factuality assessment.

View Paper