SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs

Sultan Alrashed

2024-12-16

SmolTulu: Higher Learning Rate to Batch Size Ratios Can Lead to Better Reasoning in SLMs

Summary

This paper talks about SmolTulu, a new language model that improves how smaller models learn by adjusting the balance between learning speed and the amount of data they process.

What's the problem?

Smaller language models often struggle to perform well on complex reasoning tasks because they are limited in size and capacity. Traditional training methods usually involve larger batch sizes with lower learning rates, which may not be effective for these smaller models, leading to poorer performance compared to larger models.

What's the solution?

SmolTulu addresses this issue by using higher learning rates in relation to smaller batch sizes. This means that the model learns more from each piece of data it processes, similar to how a student might benefit from focused, intensive study sessions rather than large lectures. The researchers conducted experiments showing that this approach helps smaller models perform better on reasoning tasks, achieving state-of-the-art results compared to other models of similar size.

Why it matters?

This research is important because it demonstrates that smaller language models can be optimized for better performance without needing to increase their size. By focusing on efficient training strategies, SmolTulu could make advanced AI capabilities more accessible and less resource-intensive, allowing more organizations to develop effective language models without requiring expensive hardware.

Abstract

We present SmolTulu-1.7b-Instruct, referenced in this report as SmolTulu-DPO-1130, an instruction-tuned language model that adapts AllenAI's Tulu 3 post-training pipeline to enhance Huggingface's SmolLM2-1.7B base model. Through comprehensive empirical analysis using a 135M parameter model, we demonstrate that the relationship between learning rate and batch size significantly impacts model performance in a task-dependent manner. Our findings reveal a clear split: reasoning tasks like ARC and GSM8K benefit from higher learning rate to batch size ratios, while pattern recognition tasks such as HellaSwag and IFEval show optimal performance with lower ratios. These insights informed the development of SmolTulu, which achieves state-of-the-art performance among sub-2B parameter models on instruction following, scoring 67.7% on IFEval (Delta11%), and mathematical reasoning with 51.6% on GSM8K (Delta3.4%), with an alternate version achieving scoring 57.1% on ARC (Delta5.4%). We release our model, training recipes, and ablation studies to facilitate further research in efficient model alignment, demonstrating that careful adaptation of optimization dynamics can help bridge the capability gap between small and large language models.

View Paper