R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

Zhongyang Li, Ziyue Li, Tianyi Zhou

2025-02-28

R2-T2: Re-Routing in Test-Time for Multimodal Mixture-of-Experts

Summary

This paper talks about a new method called R2-T2 (Re-Routing in Test-Time) that improves how AI systems that use both text and images (multimodal models) work, especially when dealing with visual information.

What's the problem?

Current AI models that handle both text and images are really good at understanding language, but not as good at understanding visual information. This makes it hard for them to perform well on complex tasks that involve both. Even when using a system called mixture-of-experts (MoE) to help with visual understanding, the part that decides which expert to use (the router) doesn't always make the best choices.

What's the solution?

The researchers created R2-T2, which adjusts how the router chooses experts while the AI is working on a task, without needing to retrain the whole system. R2-T2 looks at examples where the AI got the right answer and tries to make similar choices for new, similar problems. They came up with three different ways to do this, each with its own approach to finding similar problems and making adjustments.

Why it matters?

This matters because it can make AI systems that work with both text and images much better at handling complex tasks without needing expensive retraining. It could lead to more accurate and versatile AI assistants, better image recognition systems, and improved performance in fields like medical imaging or autonomous vehicles where understanding both visual and textual information is crucial.

Abstract

In large multimodal models (LMMs), the perception of non-language modalities (e.g., visual representations) is usually not on par with the large language models (LLMs)' powerful reasoning capabilities, deterring LMMs' performance on challenging downstream tasks. This weakness has been recently mitigated by replacing the vision encoder with a mixture-of-experts (MoE), which provides rich, multi-granularity, and diverse representations required by diverse downstream tasks. The performance of multimodal MoE largely depends on its router, which reweights and mixes the representations of different experts for each input. However, we find that the end-to-end trained <PRE_TAG>router</POST_TAG> does not always produce the optimal routing weights for every test sample. To bridge the gap, we propose a novel and efficient method "Re-Routing in Test-Time(R2-T2) that locally optimizes the vector of <PRE_TAG>routing weights</POST_TAG> in test-time by moving it toward those vectors of the correctly predicted samples in a neighborhood of the test sample. We propose three R2-T2 strategies with different optimization objectives and neighbor-search spaces. R2-T2 consistently and greatly improves state-of-the-art LMMs' performance on challenging benchmarks of diverse tasks, without training any base-model parameters.

View Paper