Challenges in Trustworthy Human Evaluation of Chatbots

Wenting Zhao, Alexander M. Rush, Tanya Goyal

2024-12-06

Challenges in Trustworthy Human Evaluation of Chatbots

Summary

This paper talks about the difficulties in getting reliable human evaluations for chatbots, especially on platforms that collect user preferences to rank different chatbot performances.

What's the problem?

While platforms like Chatbot Arena are considered trustworthy for evaluating chatbots, they face challenges in ensuring that the feedback collected from users is of high quality. Poor quality annotations can come from users who are not motivated to give accurate votes or from malicious users trying to unfairly boost a specific chatbot's ranking. Even a small percentage of these bad votes can significantly affect the overall rankings.

What's the solution?

The authors identify three main sources of bad annotations and demonstrate that just 10% of low-quality votes can change a chatbot's ranking by up to five places. They discuss the need for better methods to ensure that human feedback is reliable and effective in evaluating chatbot performance.

Why it matters?

This research is important because it highlights the challenges of using human evaluations to assess AI systems like chatbots. Ensuring high-quality feedback is crucial for accurately ranking and improving chatbot performance, which ultimately affects user experience and trust in these technologies.

Abstract

Open community-driven platforms like Chatbot Arena that collect user preference data from site visitors have gained a reputation as one of the most trustworthy publicly available benchmarks for LLM performance. While now standard, it is tricky to implement effective guardrails to collect high-quality annotations from humans. In this paper, we demonstrate that three sources of bad annotations, both malicious and otherwise, can corrupt the reliability of open leaderboard rankings. In particular, we show that only 10\% of poor quality votes by apathetic (site visitors not appropriately incentivized to give correct votes) or adversarial (bad actors seeking to inflate the ranking of a target model) annotators can change the rankings of models by up to 5 places on the leaderboard. Finally, we discuss open challenges in ensuring high-quality human annotations.

View Paper