I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations
Julia Kharchenko, Tanya Roosta, Aman Chadha, Chirag Shah
2025-08-08
Summary
This paper talks about a new benchmark that tests how large language models react to subtle language cues, called linguistic shibboleths, which can reveal someone's background like gender or social class.
What's the problem?
The problem is that these language models tend to give lower scores to people who use hesitant or indirect language, even when their answers have the same quality, showing hidden biases against certain ways of speaking.
What's the solution?
The solution was to create a controlled set of interview questions and answers that only change specific language features without changing the meaning, allowing precise measurement of how language style affects the model's evaluation.
Why it matters?
This matters because it reveals that AI systems used in hiring might unfairly judge people based on how they speak, instead of what they know, highlighting the need to make these systems fairer and free from such biases.
Abstract
A benchmark evaluates Large Language Models' response to linguistic markers that reveal demographic attributes, demonstrating systematic penalization of hedging language despite equivalent content quality.