A benchmark evaluates Large Language Models' response to linguistic markers that reveal demographic attributes, demonstrating systematic penalization of hedging language despite equivalent content quality.

This paper talks about a new benchmark that tests how large language models react to subtle language cues, called linguistic shibboleths, which can reveal someone's background like gender or social class.

I Think, Therefore I Am Under-Qualified? A Benchmark for Evaluating Linguistic Shibboleth Detection in LLM Hiring Evaluations

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract