oMeBench: Towards Robust Benchmarking of LLMs in Organic Mechanism Elucidation and Reasoning
Ruiling Xu, Yifan Zhang, Qingyun Wang, Carl Edwards, Heng Ji
2025-10-14
Summary
This paper introduces a new way to test if artificial intelligence (AI) can actually *understand* how chemical reactions happen, not just memorize information about them.
What's the problem?
While AI models are getting good at suggesting how to make molecules, it's unclear if they truly grasp the underlying chemistry. They might suggest things that don't follow the rules of how reactions actually work, like creating unstable intermediate molecules or skipping logical steps. There wasn't a good way to specifically test this kind of 'chemical reasoning' ability in AI.
What's the solution?
The researchers created a large dataset called oMeBench, containing over 10,000 individual steps of organic reaction mechanisms, complete with details about the molecules involved and how difficult each step is. They also developed a system, oMeS, to score AI responses not just on whether the final answer is correct, but also on whether each step makes chemical sense and is consistent with known chemistry. They then tested several AI models with this new benchmark and found that even the best ones struggled with complex, multi-step reactions, but performance improved significantly when the AI was specifically trained on the oMeBench data.
Why it matters?
This work is important because it provides a rigorous test for AI in chemistry. It helps us understand the limitations of current AI models and guides the development of AI that can truly assist chemists in designing new molecules and reactions, rather than just making educated guesses.
Abstract
Organic reaction mechanisms are the stepwise elementary reactions by which reactants form intermediates and products, and are fundamental to understanding chemical reactivity and designing new molecules and reactions. Although large language models (LLMs) have shown promise in understanding chemical tasks such as synthesis design, it is unclear to what extent this reflects genuine chemical reasoning capabilities, i.e., the ability to generate valid intermediates, maintain chemical consistency, and follow logically coherent multi-step pathways. We address this by introducing oMeBench, the first large-scale, expert-curated benchmark for organic mechanism reasoning in organic chemistry. It comprises over 10,000 annotated mechanistic steps with intermediates, type labels, and difficulty ratings. Furthermore, to evaluate LLM capability more precisely and enable fine-grained scoring, we propose oMeS, a dynamic evaluation framework that combines step-level logic and chemical similarity. We analyze the performance of state-of-the-art LLMs, and our results show that although current models display promising chemical intuition, they struggle with correct and consistent multi-step reasoning. Notably, we find that using prompting strategy and fine-tuning a specialist model on our proposed dataset increases performance by 50% over the leading closed-source model. We hope that oMeBench will serve as a rigorous foundation for advancing AI systems toward genuine chemical reasoning.