The Quest for Reliable Metrics of Responsible AI

Theresia Veronika Rampisela, Maria Maistro, Tuukka Ruotsalo, Christina Lioma

2025-10-30

The Quest for Reliable Metrics of Responsible AI

Summary

This paper discusses the importance of developing Artificial Intelligence, including AI used in scientific fields, in a responsible way, and focuses on making sure the tools we use to *measure* that responsibility are trustworthy.

What's the problem?

As AI becomes more powerful, people are trying to figure out how to make sure it's fair and doesn't cause harm. We use 'metrics,' or scores, to check if AI is being responsible. However, nobody has really looked closely at whether those metrics themselves are reliable – meaning, can they be easily tricked or give misleading results? If the tools we use to measure responsibility are flawed, then we can't really trust that the AI *is* actually responsible.

What's the solution?

The authors looked at previous research on fairness metrics used in recommendation systems (like those used by Netflix or Amazon) to see how those metrics could be unreliable. They then took the lessons learned from that research and created a set of guidelines to help developers build more robust and trustworthy metrics for responsible AI in *any* field, including science.

Why it matters?

This work is important because if we want AI to be a force for good, we need to be able to accurately assess its impact. If we can't trust the metrics we use to evaluate AI, we risk building systems that seem responsible on the surface but are actually unfair, biased, or harmful. These guidelines help ensure we're measuring the right things and doing so in a reliable way, which is crucial for the safe and ethical development of AI.

Abstract

The development of Artificial Intelligence (AI), including AI in Science (AIS), should be done following the principles of responsible AI. Progress in responsible AI is often quantified through evaluation metrics, yet there has been less work on assessing the robustness and reliability of the metrics themselves. We reflect on prior work that examines the robustness of fairness metrics for recommender systems as a type of AI application and summarise their key takeaways into a set of non-exhaustive guidelines for developing reliable metrics of responsible AI. Our guidelines apply to a broad spectrum of AI applications, including AIS.

View Paper