Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution Autonomous Driving VQA from Peru

Dunant Cusipuma, David Ortega, Victor Flores-Benites, Arturo Deza

2025-03-12

Robusto-1 Dataset: Comparing Humans and VLMs on real out-of-distribution
Autonomous Driving VQA from Peru

Summary

This paper talks about the Robusto-1 dataset, which compares how humans and AI systems understand tricky driving situations in Peru to see if self-driving cars think like people do.

What's the problem?

AI in self-driving cars struggles with unusual driving scenarios that humans handle easily, like aggressive drivers or weird street objects, which can make the AI act unpredictably.

What's the solution?

The Robusto-1 dataset uses dashcam videos from Peru’s chaotic roads and asks both humans and AI the same questions about driving situations, then checks where their answers match or differ.

Why it matters?

This helps improve self-driving AI by showing where it needs to think more like humans, making it safer for places with unpredictable driving conditions.

Abstract

As multimodal foundational models start being deployed experimentally in Self-Driving cars, a reasonable question we ask ourselves is how similar to humans do these systems respond in certain driving situations -- especially those that are out-of-distribution? To study this, we create the Robusto-1 dataset that uses dashcam video data from Peru, a country with one of the worst (aggressive) drivers in the world, a high traffic index, and a high ratio of bizarre to non-bizarre street objects likely never seen in training. In particular, to preliminarly test at a cognitive level how well Foundational Visual Language Models (VLMs) compare to Humans in Driving, we move away from bounding boxes, segmentation maps, occupancy maps or trajectory estimation to multi-modal Visual Question Answering (VQA) comparing both humans and machines through a popular method in systems neuroscience known as Representational Similarity Analysis (RSA). Depending on the type of questions we ask and the answers these systems give, we will show in what cases do VLMs and Humans converge or diverge allowing us to probe on their cognitive alignment. We find that the degree of alignment varies significantly depending on the type of questions asked to each type of system (Humans vs VLMs), highlighting a gap in their alignment.

View Paper