MARVIS: Modality Adaptive Reasoning over VISualizations

Benjamin Feuer, Lennart Purucker, Oussama Elachqar, Chinmay Hegde

2025-07-03

MARVIS: Modality Adaptive Reasoning over VISualizations

Summary

This paper talks about MARVIS, a new method that helps vision-language models work better across many different types of data like images, audio, biological, and tabular data without needing extra training. It does this by turning complex data into pictures that the models can understand and reason about using their visual skills.

What's the problem?

The problem is that AI models often need to be specially trained for each kind of data, which takes a lot of time and makes them less flexible. Many kinds of data, especially non-image types, are difficult for models to understand well.

What's the solution?

The researchers designed MARVIS to create visual representations of all kinds of data, so small vision-language models can use their natural ability to understand images to make predictions. This method works well without any extra training and maintains privacy by not exposing sensitive information.

Why it matters?

This matters because it makes AI more flexible and able to handle many different tasks and data types using one model. It also protects privacy and lowers the cost and effort of training, helping AI be applied more widely and easily in fields like healthcare, finance, and science.

Abstract

MARVIS, a training-free method, enhances small vision-language models to predict across various data modalities with high accuracy using latent embeddings and spatial reasoning.

View Paper