AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

Matic Fučka, Vitjan Zavrtanik, Danijel Skočaj

2026-04-10

AnomalyVFM -- Transforming Vision Foundation Models into Zero-Shot Anomaly Detectors

Summary

This paper focuses on finding unusual or 'anomalous' parts of images without ever having been shown examples of what those anomalies look like beforehand, a process called zero-shot anomaly detection.

What's the problem?

Current methods for zero-shot anomaly detection often rely on models that understand both images and text, like CLIP. However, models that *only* work with images, like DINOv2, haven't performed as well. The researchers believe this is because the datasets used to help these image-only models learn what's 'normal' aren't diverse enough, and the ways they've been adapted to detect anomalies are too simple.

What's the solution?

The researchers created a new framework called AnomalyVFM. This framework generates a more varied set of synthetic images to help the image-only models understand what typical images look like. It also uses a clever technique to fine-tune these models with only a small number of adjustments, focusing on the most important features and weighting the learning based on how confident the model is. This allows the image-only models to become much better at spotting anomalies.

Why it matters?

This work is important because it shows that image-only models can be just as good, or even better, at finding anomalies than models that also use text. This is significant because image-only models are often faster and require less data. By improving their performance, this research opens up possibilities for more efficient and effective anomaly detection in various applications like medical imaging or industrial inspection.

Abstract

Zero-shot anomaly detection aims to detect and localise abnormal regions in the image without access to any in-domain training images. While recent approaches leverage vision-language models (VLMs), such as CLIP, to transfer high-level concept knowledge, methods based on purely vision foundation models (VFMs), like DINOv2, have lagged behind in performance. We argue that this gap stems from two practical issues: (i) limited diversity in existing auxiliary anomaly detection datasets and (ii) overly shallow VFM adaptation strategies. To address both challenges, we propose AnomalyVFM, a general and effective framework that turns any pretrained VFM into a strong zero-shot anomaly detector. Our approach combines a robust three-stage synthetic dataset generation scheme with a parameter-efficient adaptation mechanism, utilising low-rank feature adapters and a confidence-weighted pixel loss. Together, these components enable modern VFMs to substantially outperform current state-of-the-art methods. More specifically, with RADIO as a backbone, AnomalyVFM achieves an average image-level AUROC of 94.1% across 9 diverse datasets, surpassing previous methods by significant 3.3 percentage points. Project Page: https://maticfuc.github.io/anomaly_vfm/

View Paper