CLASS-IT: Conversational and Lecture-Aligned Small-Scale Instruction Tuning for BabyLMs

Luca Capone, Alessandro Bondielli, Alessandro Lenci

2025-10-31

CLASS-IT: Conversational and Lecture-Aligned Small-Scale Instruction Tuning for BabyLMs

Summary

This research explores whether smaller language models can be improved by a training technique called instruction tuning, which is essentially teaching the model to follow instructions like a human would.

What's the problem?

Large language models are powerful, but require a lot of computing power and data. Researchers wanted to know if smaller, more manageable language models could also benefit from instruction tuning, and if so, what's the best way to do it. Specifically, they wondered if it's better to train the model on a mix of different instruction types all at once, or to teach them one type at a time in a specific order.

What's the solution?

The researchers took small language models with 100 million and 140 million parameters and trained them using different sets of instructions focused on conversations and answering questions. They tried two approaches: merging all the instruction data together for simultaneous training, and using a sequential curriculum where the model learns one type of instruction before moving on to the next. They then tested how well these models performed on both tasks they were specifically trained for, and on completely new, unseen tasks to see how well they generalized.

Why it matters?

This work shows that instruction tuning *can* help smaller language models, but there's a trade-off. While it improves performance on tasks similar to the training data, it doesn't always translate to better performance on entirely new types of problems. This is important because it suggests that simply copying how humans learn isn't enough for these smaller models, and that a more balanced approach – like a carefully designed curriculum – is needed to make them truly versatile without needing massive resources.

Abstract

This work investigates whether small-scale LMs can benefit from instruction tuning. We compare conversational and question-answering instruction tuning datasets, applied either in a merged or sequential curriculum, using decoder-only models with 100M and 140M parameters. Evaluation spans both fine-tuning (SuperGLUE) and zero-shot (BLiMP, EWoK, WUGs, entity tracking, and psycholinguistic correlation) settings. Results show that instruction tuning yields small but consistent gains in fine-tuning scenarios, with sequential curricula outperforming merged data; however, improvements do not consistently transfer to zero-shot tasks, suggesting a trade-off between interaction-focused adaptation and broad linguistic generalization. These results highlight both the potential and the constraints of adapting human-inspired learning strategies to low-resource LMs, and point toward hybrid, curriculum-based approaches for enhancing generalization under ecological training limits.

View Paper