An accurate detection is not all you need to combat label noise in web-noisy datasets

Paul Albert, Jack Valmadre, Eric Arazo, Tarun Krishna, Noel E. O'Connor, Kevin McGuinness

2024-07-11

An accurate detection is not all you need to combat label noise in web-noisy datasets

Summary

This paper discusses how to effectively detect and handle label noise in datasets collected from the web. It explores the limitations of current methods and proposes a new hybrid approach to improve classification accuracy despite noisy data.

What's the problem?

When training classifiers using data from the web, there are often many errors in the labels (annotations) and irrelevant examples. While some methods can accurately detect out-of-distribution (OOD) samples, which are not representative of the intended categories, this detection does not always lead to better performance in classifying clean examples. In fact, some valuable clean examples can be overlooked during this process.

What's the solution?

The authors propose a hybrid solution that combines two approaches: one that uses linear separation to detect noise and another that employs a state-of-the-art method called small-loss. This combination helps ensure that the classifier not only identifies noisy data but also retains important clean examples that might be visually simple but still useful for learning. By integrating these methods, they significantly improve classification results when dealing with real-world noisy web data.

Why it matters?

This research is important because it highlights the challenges of using web-crawled datasets for training machine learning models. By developing a more effective way to manage label noise, the findings can lead to better-performing classifiers, which is crucial for applications that rely on accurate data interpretation, such as image recognition and automated systems.

Abstract

Training a classifier on web-crawled data demands learning algorithms that are robust to annotation errors and irrelevant examples. This paper builds upon the recent empirical observation that applying unsupervised contrastive learning to noisy, web-crawled datasets yields a feature representation under which the in-distribution (ID) and out-of-distribution (OOD) samples are linearly separable. We show that direct estimation of the separating hyperplane can indeed offer an accurate detection of OOD samples, and yet, surprisingly, this detection does not translate into gains in classification accuracy. Digging deeper into this phenomenon, we discover that the near-perfect detection misses a type of clean examples that are valuable for supervised learning. These examples often represent visually simple images, which are relatively easy to identify as clean examples using standard loss- or distance-based methods despite being poorly separated from the OOD distribution using unsupervised learning. Because we further observe a low correlation with SOTA metrics, this urges us to propose a hybrid solution that alternates between noise detection using linear separation and a state-of-the-art (SOTA) small-loss approach. When combined with the SOTA algorithm PLS, we substantially improve SOTA results for real-world image classification in the presence of web noise github.com/PaulAlbert31/LSA

View Paper