Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs

Sandeep Mishra, Devichand Budagam, Anubhab Mandal, Bishal Santra, Pawan Goyal, Manish Gupta

2026-01-12

Router-Suggest: Dynamic Routing for Multimodal Auto-Completion in Visually-Grounded Dialogs

Summary

This paper introduces a new approach to auto-completion, called Multimodal Auto-Completion (MAC), which aims to predict what you're going to type next in a chat or other digital interface, but it does so by considering not just the text you've already typed, but also any images or visual information that's being shared.

What's the problem?

Traditional auto-completion systems only look at the text you've already written to guess what you'll say next. This can be inaccurate because your meaning is often influenced by what you *see* in the conversation, like an image someone sent or the design you're working on. The problem is that existing systems don't effectively use this visual context, leading to less helpful and sometimes frustrating auto-completion suggestions.

What's the solution?

The researchers created new datasets specifically for this multimodal auto-completion task, adapting existing chat datasets to include visual information. They then tested different 'vision-language models' – AI systems that can understand both images and text – and compared them to standard text-only models. To improve speed and efficiency, they developed a system called Router-Suggest, which intelligently chooses whether to use a faster text-based model or a more accurate but slower vision-language model depending on the situation. They also created a simplified version of Router-Suggest for devices with limited processing power.

Why it matters?

This work is important because it shows that incorporating visual information significantly improves auto-completion quality and user satisfaction. By understanding the visual context, the system can provide more relevant and helpful suggestions, ultimately saving users time and effort. This is especially valuable in applications like digital assistants, chatbots, and healthcare where understanding the full context is crucial for effective communication.

Abstract

Real-time multimodal auto-completion is essential for digital assistants, chatbots, design tools, and healthcare consultations, where user inputs rely on shared visual context. We introduce Multimodal Auto-Completion (MAC), a task that predicts upcoming characters in live chats using partially typed text and visual cues. Unlike traditional text-only auto-completion (TAC), MAC grounds predictions in multimodal context to better capture user intent. To enable this task, we adapt MMDialog and ImageChat to create benchmark datasets. We evaluate leading vision-language models (VLMs) against strong textual baselines, highlighting trade-offs in accuracy and efficiency. We present Router-Suggest, a router framework that dynamically selects between textual models and VLMs based on dialog context, along with a lightweight variant for resource-constrained environments. Router-Suggest achieves a 2.3x to 10x speedup over the best-performing VLM. A user study shows that VLMs significantly excel over textual models on user satisfaction, notably saving user typing effort and improving the quality of completions in multi-turn conversations. These findings underscore the need for multimodal context in auto-completions, leading to smarter, user-aware assistants.

View Paper