PhyCritic: Multimodal Critic Models for Physical AI

Tianyi Xiong, Shihao Wang, Guilin Liu, Yi Dong, Ming Li, Heng Huang, Jan Kautz, Zhiding Yu

2026-02-12

PhyCritic: Multimodal Critic Models for Physical AI

Summary

This paper introduces a new AI model called PhyCritic, designed to be a better judge of other AI systems, especially those dealing with physical tasks like robotics or understanding how things work in the real world.

What's the problem?

Current AI 'critics' – programs that evaluate the quality of other AI’s responses – are mostly trained on simple tasks like describing pictures or answering questions about them. They aren’t very good at judging AI that needs to understand physics, plan actions, or reason about the physical world. This is a problem because as AI gets more sophisticated and starts controlling robots or simulating real-world scenarios, we need reliable ways to assess its performance in these complex areas.

What's the solution?

The researchers created PhyCritic using a two-step process. First, they 'warmed up' the model by having it practice understanding physical skills. Then, they used a technique called 'self-referential finetuning' where PhyCritic predicts its *own* answer before judging another AI’s response. This helps PhyCritic be more consistent and accurate in its evaluations, especially when it comes to physical correctness. Essentially, it's like the critic checking its own work before grading someone else's.

Why it matters?

This work is important because it provides a better way to evaluate AI systems designed for physical tasks. Having a strong critic allows us to improve these AI systems, making them more reliable and capable of solving real-world problems. It also shows that specifically training a critic for a certain type of task, like physical AI, leads to better results than using a general-purpose critic.

Abstract

With the rapid development of large multimodal models, reliable judge and critic models have become essential for open-ended evaluation and preference alignment, providing pairwise preferences, numerical scores, and explanatory justifications for assessing model-generated responses. However, existing critics are primarily trained in general visual domains such as captioning or image question answering, leaving physical AI tasks involving perception, causal reasoning, and planning largely underexplored. We introduce PhyCritic, a multimodal critic model optimized for physical AI through a two-stage RLVR pipeline: a physical skill warmup stage that enhances physically oriented perception and reasoning, followed by self-referential critic finetuning, where the critic generates its own prediction as an internal reference before judging candidate responses, improving judgment stability and physical correctness. Across both physical and general-purpose multimodal judge benchmarks, PhyCritic achieves strong performance gains over open-source baselines and, when applied as a policy model, further improves perception and reasoning in physically grounded tasks.

View Paper