Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer

Md Ashiqur Rahman, Chiao-An Yang, Michael N. Cheng, Lim Jun Hao, Jeremiah Jiang, Teck-Yian Lim, Raymond A. Yeh

2025-08-21

Local Scale Equivariance with Latent Deep Equilibrium Canonicalizer

Summary

This paper introduces a new technique called the deep equilibrium canonicalizer (DEC) to help computer vision models better understand objects in images, even when those objects are different sizes or at different distances from the camera.

What's the problem?

A big issue in computer vision is that the same kind of object can look very different in pictures because it might be bigger or smaller, or closer or farther away from the camera. These size differences can happen to different objects within the same picture, making it hard for computers to recognize them consistently.

What's the solution?

The researchers created a "deep equilibrium canonicalizer" (DEC) that can be added to existing computer vision programs. This DEC helps the programs become more "scale equivariant," meaning they can recognize objects regardless of their size variations. It can even be used with models that have already been trained.

Why it matters?

This new method is important because it improves how well computer vision models perform on tasks like recognizing images, and it makes them more consistent in identifying objects of different sizes. They proved it works well on a popular test called ImageNet, making popular models like ViT, DeiT, Swin, and BEiT even better.

Abstract

Scale variation is a fundamental challenge in computer vision. Objects of the same class can have different sizes, and their perceived size is further affected by the distance from the camera. These variations are local to the objects, i.e., different object sizes may change differently within the same image. To effectively handle scale variations, we present a deep equilibrium canonicalizer (DEC) to improve the local scale equivariance of a model. DEC can be easily incorporated into existing network architectures and can be adapted to a pre-trained model. Notably, we show that on the competitive ImageNet benchmark, DEC improves both model performance and local scale consistency across four popular pre-trained deep-nets, e.g., ViT, DeiT, Swin, and BEiT. Our code is available at https://github.com/ashiq24/local-scale-equivariance.

View Paper