Forbidden Science: Dual-Use AI Challenge Benchmark and Scientific Refusal Tests
David Noever, Forrest McKee
2025-02-11
Summary
This paper talks about a new way to test how safe and responsible AI language models are when dealing with sensitive topics like controlled substances. The researchers created a set of questions to see how different AI models balance refusing to answer potentially dangerous queries while still allowing legitimate scientific discussions.
What's the problem?
It's hard to make AI models that can tell the difference between harmful requests and important scientific questions. Current ways of testing AI safety might not catch all the problems or might accidentally block useful information.
What's the solution?
The researchers made a special test using questions about controlled substances. They used this test on four big AI models to see how often each one refused to answer and how consistent their answers were when questions were asked in different ways. They made this test available for anyone to use, so people can keep track of how AI safety is improving over time.
Why it matters?
This matters because as AI gets smarter, we need to make sure it's safe to use without stopping it from helping with important science. This test helps developers create AI that can avoid dangerous topics while still being useful for research. It could lead to AI that's both safer and better at handling complex scientific discussions in the future.
Abstract
The development of robust safety benchmarks for large language models requires open, reproducible datasets that can measure both appropriate refusal of harmful content and potential over-restriction of legitimate scientific discourse. We present an open-source dataset and testing framework for evaluating LLM safety mechanisms across mainly controlled substance queries, analyzing four major models' responses to systematically varied prompts. Our results reveal distinct safety profiles: Claude-3.5-sonnet demonstrated the most conservative approach with 73% refusals and 27% allowances, while Mistral attempted to answer 100% of queries. GPT-3.5-turbo showed moderate restriction with 10% refusals and 90% allowances, and Grok-2 registered 20% refusals and 80% allowances. Testing prompt variation strategies revealed decreasing response consistency, from 85% with single prompts to 65% with five variations. This publicly available benchmark enables systematic evaluation of the critical balance between necessary safety restrictions and potential over-censorship of legitimate scientific inquiry, while providing a foundation for measuring progress in AI safety implementation. Chain-of-thought analysis reveals potential vulnerabilities in safety mechanisms, highlighting the complexity of implementing robust safeguards without unduly restricting desirable and valid scientific discourse.