InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

Pengkai Wang, Qi Zuo, Pengwei Liu, Zhijie Sang, Congkai Xie, Hongxia Yang

2025-10-20

InfiMed-ORBIT: Aligning LLMs on Open-Ended Complex Tasks via Rubric-Based Incremental Training

Summary

This paper introduces a new method, called ORBIT, to improve how large language models (LLMs) perform in complex, open-ended tasks like medical consultations. It focuses on overcoming the difficulty of providing useful feedback to these models when there isn't a clear 'right' or 'wrong' answer.

What's the problem?

Large language models are getting really good at things like math and coding because they can be easily checked for correctness. However, when it comes to tasks requiring judgment or creativity, like giving medical advice or writing stories, it's hard to create a reward system that accurately reflects quality. Without good feedback, these models struggle to improve in these areas because it's difficult to define what a 'good' response even looks like.

What's the solution?

The researchers developed ORBIT, a system that creates its own guidelines, called 'rubrics', to evaluate the LLM's responses during a medical dialogue. Instead of relying on pre-existing medical knowledge or human-written rules, ORBIT generates realistic conversations and then uses these rubrics to provide feedback to the model, helping it learn incrementally. They applied this to a model called Qwen3-4B-Instruct and saw a significant improvement in its performance on a medical benchmark.

Why it matters?

This work is important because it offers a way to train LLMs to handle complex, real-world tasks where clear-cut answers don't exist. By allowing the model to learn from dynamically created rubrics, it avoids the limitations of needing extensive pre-defined knowledge or subjective human evaluation, making it a scalable approach to improving LLMs in fields like healthcare and scientific reasoning.

Abstract

Large Language Models (LLMs) have shown substantial advances through reinforcement learning (RL), particularly in domains where rewards can be programmatically verified, such as mathematics and code. In these areas, models benefit from a well-defined operational base guided by explicit rule-based objectives. However, this progress reveals a significant limitation: in open-ended domains where rewards are ambiguous, subjective, or context-dependent, such as creative writing, scientific reasoning, and notably medical consultation, robust reward functions are lacking, making these areas challenging for current RL strategies. To bridge this gap, we introduce ORBIT, an open-ended rubric-based incremental training framework specifically designed for high-stakes medical dialogue. ORBIT integrates syn- thetic dialogue generation with the dynamic creation of rubrics, employing these rubrics to direct an incremental RL process. In particular, this approach does not depend on external medical knowledge or manual rules, instead utilizing rubric-guided feedback to shape learning. When implemented on the Qwen3-4B-Instruct model, our method can greatly enhance its performance on the HealthBench-Hard benchmark from 7.0 to 27.2 using only 2k samples, thus achieving state-of-the-art results for models of this scale. Our analysis confirms that rubric-driven RL fos-ters consistent performance gains across diverse consultation scenarios, going beyond simple numerical improvements. These findings underscore rubric-based feedback as a scalable strategy for advancing LLMs in intricate, open-ended tasks.

View Paper