X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework

Mohammad Zia Ur Rehman, Sai Kartheek Reddy Kasu, Shashivardhan Reddy Koppula, Sai Rithwik Reddy Chirra, Shwetank Shekhar Singh, Nagendra Kumar

2026-01-07

X-MuTeST: A Multilingual Benchmark for Explainable Hate Speech Detection and A Novel LLM-consulted Explanation Framework

Summary

This research focuses on improving the detection of hate speech online, particularly in languages that don't get as much attention as English, like Hindi and Telugu. It also aims to make these detection systems more transparent, so we can understand *why* they flag something as hateful.

What's the problem?

Detecting hate speech is hard because it's often subtle and depends on context. Current systems aren't very accurate, especially when dealing with languages other than English. Even when they *are* accurate, it's often a 'black box' – we don't know *why* the system made a certain decision, making it hard to trust or improve. There's a lack of good datasets with explanations for why something is considered hate speech in these under-represented languages.

What's the solution?

The researchers created a new system called X-MuTeST. This system combines the power of large language models (think of them as really smart AI that understands language) with techniques that highlight important words in a sentence. Crucially, they also created a dataset with human explanations – people actually labeled *which* words in a sentence led them to believe it was hateful. The system learns from these human explanations and uses them to refine its own understanding. It then explains its own decisions by looking at the difference in how it predicts the sentence with and without key phrases, and combines this with the human-provided reasoning.

Why it matters?

This work is important because it improves hate speech detection in a wider range of languages, making the internet a safer place for more people. By making the system explainable, it builds trust and allows researchers to identify and fix biases. The new dataset with human explanations is a valuable resource for future research in this area, especially for languages that haven't been well-studied.

Abstract

Hate speech detection on social media faces challenges in both accuracy and explainability, especially for underexplored Indic languages. We propose a novel explainability-guided training framework, X-MuTeST (eXplainable Multilingual haTe Speech deTection), for hate speech detection that combines high-level semantic reasoning from large language models (LLMs) with traditional attention-enhancing techniques. We extend this research to Hindi and Telugu alongside English by providing benchmark human-annotated rationales for each word to justify the assigned class label. The X-MuTeST explainability method computes the difference between the prediction probabilities of the original text and those of unigrams, bigrams, and trigrams. Final explanations are computed as the union between LLM explanations and X-MuTeST explanations. We show that leveraging human rationales during training enhances both classification performance and explainability. Moreover, combining human rationales with our explainability method to refine the model attention yields further improvements. We evaluate explainability using Plausibility metrics such as Token-F1 and IOU-F1 and Faithfulness metrics such as Comprehensiveness and Sufficiency. By focusing on under-resourced languages, our work advances hate speech detection across diverse linguistic contexts. Our dataset includes token-level rationale annotations for 6,004 Hindi, 4,492 Telugu, and 6,334 English samples. Data and code are available on https://github.com/ziarehman30/X-MuTeST

View Paper