Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework

Diego Ortego, Marlon Rodríguez, Mario Almagro, Kunal Dahiya, David Jiménez, Juan C. SanMiguel

2025-11-19

Large Language Models Meet Extreme Multi-label Classification: Scaling and Multi-modal Framework

Summary

This paper explores how to improve a specific type of artificial intelligence problem called Extreme Multi-label Classification (XMC) by using powerful AI models, specifically focusing on incorporating both text and image information.

What's the problem?

XMC deals with categorizing things into a huge number of possible labels, which is computationally difficult. Current methods often use smaller AI models to keep things efficient, but these might not be powerful enough. The challenge is to leverage the capabilities of larger, more advanced AI models – especially those that are good at understanding language and vision – without making the process too slow or resource-intensive.

What's the solution?

The researchers developed a new framework called ViXML that tackles this problem in two ways. First, they showed that using a relatively large 'decoder' model (a type of AI architecture) can significantly improve performance without drastically increasing computing costs. Second, they figured out a way to efficiently include visual information from images by summarizing each image with a single embedding, which avoids a huge increase in calculations. They also created new datasets that include both text and image data for better testing and comparison.

Why it matters?

This work is important because it demonstrates how to effectively use the latest AI advancements in a challenging real-world problem. By combining text and image understanding, and by finding ways to use larger models efficiently, they achieved significant improvements in accuracy, showing that images can provide a lot of valuable information even when using powerful language models. This could lead to better AI systems for tasks like tagging articles, categorizing products, or understanding complex scenes.

Abstract

Foundation models have revolutionized artificial intelligence across numerous domains, yet their transformative potential remains largely untapped in Extreme Multi-label Classification (XMC). Queries in XMC are associated with relevant labels from extremely large label spaces, where it is critical to strike a balance between efficiency and performance. Therefore, many recent approaches efficiently pose XMC as a maximum inner product search between embeddings learned from small encoder-only transformer architectures. In this paper, we address two important aspects in XMC: how to effectively harness larger decoder-only models, and how to exploit visual information while maintaining computational efficiency. We demonstrate that both play a critical role in XMC separately and can be combined for improved performance. We show that a few billion-size decoder can deliver substantial improvements while keeping computational overhead manageable. Furthermore, our Vision-enhanced eXtreme Multi-label Learning framework (ViXML) efficiently integrates foundation vision models by pooling a single embedding per image. This limits computational growth while unlocking multi-modal capabilities. Remarkably, ViXML with small encoders outperforms text-only decoder in most cases, showing that an image is worth billions of parameters. Finally, we present an extension of existing text-only datasets to exploit visual metadata and make them available for future benchmarking. Comprehensive experiments across four public text-only datasets and their corresponding image enhanced versions validate our proposals' effectiveness, surpassing previous state-of-the-art by up to +8.21\% in P@1 on the largest dataset. ViXML's code is available at https://github.com/DiegoOrtego/vixml.

View Paper