FineVision: Open Data Is All You Need

Luis Wiedmann, Orr Zohar, Amir Mahla, Xiaohan Wang, Rui Li, Thibaud Frere, Leandro von Werra, Aritra Roy Gosthipaty, Andrés Marafioti

2025-10-21

Summary

This paper introduces FineVision, a massive new dataset designed to improve how well computers can understand both images and language at the same time.

What's the problem?

Currently, training these 'vision-language models' is difficult because the datasets used are often messy, inconsistent, and sometimes even contain information that 'leaks' from test problems into the training data, giving the models an unfair advantage. There's no single, reliable, large dataset that researchers can use to build and test these models properly.

What's the solution?

The researchers created FineVision by combining data from over 200 different sources, resulting in 24 million image-text pairs. They didn't just automatically merge everything; they used a combination of computer programs and human reviewers to ensure the data was accurate, properly formatted, diverse, and safe. They also removed duplicate entries and checked for contamination from existing benchmarks. They even included data for tasks where a computer needs to interact with a graphical user interface, making sure those actions were valid.

Why it matters?

FineVision is important because models trained on it perform better than those trained on existing datasets. This shows that having a large, clean, and well-maintained dataset is crucial for advancing the field of vision-language models, and the researchers are making both the dataset and the tools they used to create it publicly available to help other researchers.

Abstract

The advancement of vision-language models (VLMs) is hampered by a fragmented landscape of inconsistent and contaminated public datasets. We introduce FineVision, a meticulously collected, curated, and unified corpus of 24 million samples - the largest open resource of its kind. We unify more than 200 sources into 185 subsets via a semi-automated, human-in-the-loop pipeline: automation performs bulk ingestion and schema mapping, while reviewers audit mappings and spot-check outputs to verify faithful consumption of annotations, appropriate formatting and diversity, and safety; issues trigger targeted fixes and re-runs. The workflow further applies rigorous de-duplication within and across sources and decontamination against 66 public benchmarks. FineVision also encompasses agentic/GUI tasks with a unified action space; reviewers validate schemas and inspect a sample of trajectories to confirm executable fidelity. Models trained on FineVision consistently outperform those trained on existing open mixtures across a broad evaluation suite, underscoring the benefits of scale, data hygiene, and balanced automation with human oversight. We release the corpus and curation tools to accelerate data-centric VLM research.

View Paper