Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

Atsuki Yamaguchi, Maggie Mi, Nikolaos Aletras

2026-01-08

Enhancing Linguistic Competence of Language Models through Pre-training with Language Learning Tasks

Summary

This paper introduces a new way to train language models, aiming to make them better at understanding and using language itself, not just memorizing facts.

What's the problem?

Current language models are really good at learning information from text and even doing some reasoning, but they aren't specifically trained to *understand* how language works – things like grammar and sentence structure. They learn this implicitly, but it's not their main focus, and it can take a long time for them to become truly linguistically competent.

What's the solution?

The researchers developed a method called L2T, which stands for Language to Task. It's like giving the language model language lessons alongside its regular reading. L2T takes raw text and turns it into structured exercises, like fill-in-the-blanks or reordering words, to directly teach the model about language rules. They then trained models using both regular text *and* these L2T exercises.

Why it matters?

This is important because improving a language model’s linguistic competence makes it not only better at tasks specifically testing language skills, but also helps it learn faster overall. It doesn’t hurt its ability to do other things like reasoning either, meaning we can build more powerful and efficient AI systems.

Abstract

Language models (LMs) are pre-trained on raw text datasets to generate text sequences token-by-token. While this approach facilitates the learning of world knowledge and reasoning, it does not explicitly optimize for linguistic competence. To bridge this gap, we propose L2T, a pre-training framework integrating Language Learning Tasks alongside standard next-token prediction. Inspired by human language acquisition, L2T transforms raw text into structured input-output pairs to provide explicit linguistic stimulation. Pre-training LMs on a mixture of raw text and L2T data not only improves overall performance on linguistic competence benchmarks but accelerates its acquisition, while maintaining competitive performance on general reasoning tasks.

View Paper