SHIC: Shape-Image Correspondences with no Keypoint Supervision

Aleksandar Shtedritski, Christian Rupprecht, Andrea Vedaldi

2024-07-29

SHIC: Shape-Image Correspondences with no Keypoint Supervision

Summary

This paper introduces SHIC, a new method for matching images of objects to a 3D template without needing manual supervision. It aims to improve how we understand the shapes of objects in images by using advanced computer vision techniques.

What's the problem?

Traditionally, matching pixels in images to corresponding points in a 3D model (known as canonical surface mapping) requires a lot of manual work to identify key points. This process can be expensive and time-consuming, making it difficult to apply to various objects beyond humans. The challenge is to find a way to automate this process effectively without relying on extensive human input.

What's the solution?

SHIC solves this problem by using large computer vision models that have already been trained on many types of images. Instead of matching images directly to the 3D template, SHIC first matches images to non-photorealistic versions of the template, which simplifies the task. It uses features from these advanced models to create correspondences between the images and the template. This approach allows SHIC to learn how to create accurate mappings without needing manual annotations, achieving better results than previous supervised methods.

Why it matters?

This research is significant because it makes it easier and cheaper to analyze and understand various objects in images, which has applications in fields like robotics, biology, and computer graphics. By automating the process of shape mapping, SHIC can help researchers and developers create better tools for object recognition and analysis.

Abstract

Canonical surface mapping generalizes keypoint detection by assigning each pixel of an object to a corresponding point in a 3D template. Popularised by DensePose for the analysis of humans, authors have since attempted to apply the concept to more categories, but with limited success due to the high cost of manual supervision. In this work, we introduce SHIC, a method to learn canonical maps without manual supervision which achieves better results than supervised methods for most categories. Our idea is to leverage foundation computer vision models such as DINO and Stable Diffusion that are open-ended and thus possess excellent priors over natural categories. SHIC reduces the problem of estimating image-to-template correspondences to predicting image-to-image correspondences using features from the foundation models. The reduction works by matching images of the object to non-photorealistic renders of the template, which emulates the process of collecting manual annotations for this task. These correspondences are then used to supervise high-quality canonical maps for any object of interest. We also show that image generators can further improve the realism of the template views, which provide an additional source of supervision for the model.

View Paper