ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers

Guray Ozgur, Eduarda Caldeira, Tahar Chettaoui, Jan Niklas Kolf, Marco Huber, Naser Damer, Fadi Boutros

2026-01-12

ViTNT-FIQA: Training-Free Face Image Quality Assessment with Vision Transformers

Summary

This paper introduces a new way to automatically judge the quality of face images, which is important for making sure face recognition systems work well.

What's the problem?

Current methods for checking face image quality often only look at the very end of the image processing, or they require a lot of computation by running the image through the system multiple times or even changing how the system works internally. This makes them slow or inflexible.

What's the solution?

The researchers developed a method called ViTNT-FIQA that checks how stable the image's features are as they are processed through a Vision Transformer, a type of AI. Good quality images show consistent changes in features, while bad images show erratic changes. It only needs to run the image through the system once and doesn't require any changes to the existing AI model.

Why it matters?

This new method is fast, doesn't need any training, and works well with existing face recognition systems. It can help improve the reliability of face recognition in various real-world applications by quickly identifying and potentially rejecting low-quality images before they cause errors.

Abstract

Face Image Quality Assessment (FIQA) is essential for reliable face recognition systems. Current approaches primarily exploit only final-layer representations, while training-free methods require multiple forward passes or backpropagation. We propose ViTNT-FIQA, a training-free approach that measures the stability of patch embedding evolution across intermediate Vision Transformer (ViT) blocks. We demonstrate that high-quality face images exhibit stable feature refinement trajectories across blocks, while degraded images show erratic transformations. Our method computes Euclidean distances between L2-normalized patch embeddings from consecutive transformer blocks and aggregates them into image-level quality scores. We empirically validate this correlation on a quality-labeled synthetic dataset with controlled degradation levels. Unlike existing training-free approaches, ViTNT-FIQA requires only a single forward pass without backpropagation or architectural modifications. Through extensive evaluation on eight benchmarks (LFW, AgeDB-30, CFP-FP, CALFW, Adience, CPLFW, XQLFW, IJB-C), we show that ViTNT-FIQA achieves competitive performance with state-of-the-art methods while maintaining computational efficiency and immediate applicability to any pre-trained ViT-based face recognition model.

View Paper