RealHarm: A Collection of Real-World Language Model Application Failures

Pierre Le Jeune, Jiaen Liu, Luca Rossi, Matteo Dora

2025-04-16

RealHarm: A Collection of Real-World Language Model Application Failures

Summary

This paper talks about RealHarm, a dataset that collects real examples of language model failures from actual AI deployments, showing what can go wrong when these systems are used in the real world.

What's the problem?

The problem is that most research about AI risks is based on theories or rules, not on what actually happens when language models are used by people. This means that some real dangers, like damage to a company’s reputation or spreading false information, might be missed or underestimated, and current safety systems might not be good enough to catch these failures.

What's the solution?

The researchers created RealHarm by reviewing hundreds of real-world incidents where language models caused problems, such as giving out wrong information or making companies look bad. They organized these cases into categories of harm and tested how well current guardrails and content moderation tools could have stopped these failures. They found that most of these tools missed important issues, especially subtle ones like misinformation or reputation damage.

Why it matters?

This matters because it gives companies and researchers a clearer picture of the real risks when deploying AI systems and shows that current safety measures aren’t enough. By learning from actual failures, the AI community can build better protections and make language models safer and more trustworthy for everyone.

Abstract

RealHarm, a dataset of real-world AI failures, reveals reputational damage and misinformation as primary risks in language model deployments, highlighting vulnerabilities in existing guardrails and content moderation systems.

View Paper