Teaching Metric Distance to Autoregressive Multimodal Foundational Models

Jiwan Chung, Saejin Kim, Yongrae Jo, Jaewoo Park, Dongjun Min, Youngjae Yu

2025-03-04

Teaching Metric Distance to Autoregressive Multimodal Foundational
Models

Summary

This paper talks about a new way to teach AI language models how to understand and use distance relationships between different pieces of information, especially when dealing with things like math, images, and robots.

What's the problem?

As AI language models are being used for more than just text, they need to understand how different pieces of information relate to each other in terms of distance or similarity. Current models don't handle this well, which limits their ability to work with things like numbers, images, and physical actions.

What's the solution?

The researchers created DIST2Loss, a new method that helps AI models learn about distance relationships between different pieces of information. It works by turning complicated distance information into a format that existing AI models can understand and use. This allows the AI to learn and remember important distance relationships while it's generating new information.

Why it matters?

This matters because it helps AI models perform better in a wide range of tasks, like understanding images, controlling robots, and generating new images. It's especially helpful when there isn't a lot of training data available, which could make AI more useful in situations where data or computing power is limited. This could lead to smarter, more versatile AI systems that can handle complex tasks in the real world more effectively.

Abstract

As large language models expand beyond natural language to domains such as mathematics, multimodal understanding, and embodied agents, tokens increasingly reflect metric relationships rather than purely linguistic meaning. We introduce DIST2Loss, a distance-aware framework designed to train autoregressive discrete models by leveraging predefined distance relationships among output tokens. At its core, DIST2Loss transforms continuous exponential family distributions derived from inherent distance metrics into discrete, categorical optimization targets compatible with the models' architectures. This approach enables the models to learn and preserve meaningful distance relationships during token generation while maintaining compatibility with existing architectures. Empirical evaluations show consistent performance gains in diverse multimodal applications, including visual grounding, robotic manipulation, generative reward modeling, and image generation using vector-quantized features. These improvements are pronounced in cases of limited training data, highlighting DIST2Loss's effectiveness in resource-constrained settings.

View Paper