Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Indraneil Paul, Glavaš Glavas, Iryna Gurevych

2026-05-04

Themis: Training Robust Multilingual Code Reward Models for Flexible Multi-Criteria Scoring

Summary

This paper focuses on improving how we evaluate code generated by AI, specifically by building better 'reward models' that can judge code quality beyond just whether it runs correctly.

What's the problem?

Currently, when people try to improve AI code generation, they mostly focus on whether the code *works*. This is a limited view because good code also needs to be readable, efficient, and follow good programming practices. Existing tools for judging code quality aren't very good at evaluating these other important aspects, and they don't work well across different programming languages.

What's the solution?

The researchers created a large dataset called Themis-CodeRewardBench to test existing code reward models across multiple languages and judging criteria like readability and efficiency. They found these models weren't very good at judging anything beyond basic functionality, so they built a much larger dataset of code preferences, Themis-CodePreference, and used it to train a new set of reward models called Themis-RM. These new models come in different sizes and are designed to be better at judging code based on multiple criteria and to work well across different programming languages.

Why it matters?

This work is important because better reward models mean we can train AI to generate higher-quality code that isn't just functional, but also well-written and efficient. This will lead to more reliable and useful AI tools for programmers and could help automate more complex coding tasks.

Abstract

Reward models (RMs) have become an indispensable fixture of the language model (LM) post-training playbook, enabling policy alignment and test-time scaling. Research on the application of RMs in code generation, however, has been comparatively sparse, with existing work largely focusing on execution feedback. This choice constrains post-training to optimizing functional correctness over self-contained executable code. In this work, we examine the training and evaluation of multilingual, multi-criteria code RMs. To this end, we first compile Themis-CodeRewardBench, a benchmark to evaluate code RMs across five preference dimensions (i.e., criteria) and eight programming languages, on which we profile 50+ code, math, and general-purpose RMs. Observing the limited proficiency of current RMs beyond scoring for functional correctness, we develop Themis-CodePreference, the largest open-source collection of code preferences to date (more than 350k preference pairs), and use it to train Themis-RM, a suite of multilingual code reward models for flexible multi-criteria scoring, ranging in size from 600M to 32B parameters. Our experiments and ablations demonstrate positive scaling trends, strong cross-lingual transfer when training on diverse preferences, and the importance of multi-criteria training for reliable code reward modeling.

View Paper