LLM Teacher-Student Framework for Text Classification With No Manually Annotated Data: A Case Study in IPTC News Topic Classification
Taja Kuzman, Nikola Ljubešić
2024-12-02

Summary
This paper presents a teacher-student framework that uses large language models (LLMs) to classify news topics in multiple languages without needing any manually labeled data.
What's the problem?
As the amount of news available online continues to grow, it becomes increasingly important to classify these stories by topic so that readers can easily find relevant information. However, doing this accurately across different languages is challenging, especially when there is no existing labeled data to train the models.
What's the solution?
The authors propose a framework where a large model, called the teacher model, automatically annotates news articles in several languages (Slovenian, Croatian, Greek, and Catalan) to create a training dataset. This model shows strong performance even without prior training on those specific languages. Smaller models, known as student models, are then trained on this annotated dataset. These student models can achieve high accuracy similar to the teacher model while being more efficient and easier to use. The study also explores how much training data is needed for good performance and how well the models can work across different languages.
Why it matters?
This research is significant because it provides a way to effectively classify news topics in multiple languages without requiring extensive manual work. By making it easier to organize and access news content, this framework can help improve how people find information and stay informed about current events, which is especially valuable in our increasingly globalized world.
Abstract
With the ever-increasing number of news stories available online, classifying them by topic, regardless of the language they are written in, has become crucial for enhancing readers' access to relevant content. To address this challenge, we propose a teacher-student framework based on large language models (LLMs) for developing multilingual news classification models of reasonable size with no need for manual data annotation. The framework employs a Generative Pretrained Transformer (GPT) model as the teacher model to develop an IPTC Media Topic training dataset through automatic annotation of news articles in Slovenian, Croatian, Greek, and Catalan. The teacher model exhibits a high zero-shot performance on all four languages. Its agreement with human annotators is comparable to that between the human annotators themselves. To mitigate the computational limitations associated with the requirement of processing millions of texts daily, smaller BERT-like student models are fine-tuned on the GPT-annotated dataset. These student models achieve high performance comparable to the teacher model. Furthermore, we explore the impact of the training data size on the performance of the student models and investigate their monolingual, multilingual and zero-shot cross-lingual capabilities. The findings indicate that student models can achieve high performance with a relatively small number of training instances, and demonstrate strong zero-shot cross-lingual abilities. Finally, we publish the best-performing news topic classifier, enabling multilingual classification with the top-level categories of the IPTC Media Topic schema.