LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens

Armel Zebaze, Rachel Bawden, Benoît Sagot

2025-10-15

LLM Reasoning for Machine Translation: Synthetic Data Generation over Thinking Tokens

Summary

This paper investigates whether making large language models 'think out loud' before translating languages actually improves their translation quality, similar to how humans might plan before speaking.

What's the problem?

Large language models are great at tasks like math and coding when they're allowed to generate a step-by-step thought process first, but it's unclear if this 'thinking' helps them with machine translation. The core issue is whether forcing these models to create intermediate steps, like a human translator might, leads to better translations, especially when dealing with languages that have limited data available for training.

What's the solution?

The researchers tested this by having the models generate these intermediate 'thinking tokens' during translation and also by specifically training the models to reason step-by-step using examples of how to translate. They found that simply adding these 'thinking tokens' didn't improve translation quality. However, they *did* see improvements when those intermediate steps were carefully constructed using specific translation strategies. They also compared this to training models with better translations or more translation examples and found those methods were more effective.

Why it matters?

This research shows that just mimicking the *idea* of human thought processes isn't enough to improve machine translation. It suggests that focusing on providing models with better training data – like more accurate translations or a larger collection of translated text – is a more effective way to build better translation systems than trying to force them to 'think' like humans.

Abstract

Large reasoning models (LRMs) have led to new possibilities in terms of problem-solving, through the devising of a natural language thought process prior to answering a query. While their capabilities are well known across mathematics and coding tasks, their impact on the task of machine translation (MT) remains underexplored. In this work, we explore the benefits of the generation of intermediate tokens when performing MT across multiple language pairs of different levels of resourcedness and multiple setups. We find that "thinking tokens" do not help LRMs better perform MT. This result generalizes to models fine-tuned to reason before translating using distilled chain of thought (CoT) inspired by human translators' practices. Specifically, fine-tuning a model with synthetic CoT explanations detailing how to translate step-by-step does not outperform standard input-output fine-tuning. However, constructing the intermediate tokens by combining the outputs of modular translation-specific prompting strategies results in improvements. Our findings underscore that the contribution of intermediate tokens during fine-tuning highly depends on the presence of translation attempts within them. More broadly, our results suggest that using a teacher to refine target translations or to expand parallel corpora is more impactful than distilling their CoT explanations into "thinking" MT models.

View Paper