RoMa v2: Harder Better Faster Denser Feature Matching
Johan Edstedt, David Nordström, Yushan Zhang, Georg Bökman, Jonathan Astermark, Viktor Larsson, Anders Heyden, Fredrik Kahl, Mårten Wadenbäck, Michael Felsberg
2025-11-20
Summary
This paper introduces a new and improved system for finding corresponding points between two images of the same scene, a process called dense feature matching.
What's the problem?
Currently, even the best dense feature matching systems struggle with difficult, real-world images and often aren't fast enough for practical use. High-accuracy systems are slow, and systems that *are* fast often aren't very accurate, especially when dealing with challenging scenes. Existing methods just aren't robust enough for many situations.
What's the solution?
The researchers tackled this problem by making improvements in several areas. They designed a new way to structure the matching process and a new way to train the model, using a diverse set of images. They also sped up the training process by breaking it into two stages and optimized the computer code to use less memory. Finally, they incorporated a powerful pre-trained image model, called DINOv3, to make the system more reliable and less biased in its matching.
Why it matters?
This work is important because it creates a significantly more accurate and faster dense feature matching system than previous ones. This advancement has the potential to improve many applications that rely on understanding the relationship between images, like robotics, self-driving cars, and 3D reconstruction, by providing more reliable and efficient image understanding.
Abstract
Dense feature matching aims to estimate all correspondences between two images of a 3D scene and has recently been established as the gold-standard due to its high accuracy and robustness. However, existing dense matchers still fail or perform poorly for many hard real-world scenarios, and high-precision models are often slow, limiting their applicability. In this paper, we attack these weaknesses on a wide front through a series of systematic improvements that together yield a significantly better model. In particular, we construct a novel matching architecture and loss, which, combined with a curated diverse training distribution, enables our model to solve many complex matching tasks. We further make training faster through a decoupled two-stage matching-then-refinement pipeline, and at the same time, significantly reduce refinement memory usage through a custom CUDA kernel. Finally, we leverage the recent DINOv3 foundation model along with multiple other insights to make the model more robust and unbiased. In our extensive set of experiments we show that the resulting novel matcher sets a new state-of-the-art, being significantly more accurate than its predecessors. Code is available at https://github.com/Parskatt/romav2