Beyond Decoder-only: Large Language Models Can be Good Encoders for Machine Translation

Yingfeng Luo, Tong Zheng, Yongyu Mu, Bei Li, Qinghong Zhang, Yongqi Gao, Ziqiang Xu, Peinan Feng, Xiaoqian Liu, Tong Xiao, Jingbo Zhu

2025-03-12

Beyond Decoder-only: Large Language Models Can be Good Encoders for
Machine Translation

Summary

This paper talks about using big language AI models in a smarter way for translation by letting them focus on understanding the input text, while a smaller AI handles the actual translating, making the process faster and better.

What's the problem?

Current translation AI models that use only one part of the system (the decoder) are slow and use lots of computer power, while older two-part models (encoder-decoder) aren’t as good at handling complex language.

What's the solution?

The new method combines a powerful language model for understanding text with a specialized translator AI, using a bridge tool to help them work together efficiently without slowing things down.

Why it matters?

This makes translation tools faster, cheaper to run, and better at handling tricky languages or specialized texts, helping people communicate across languages more easily.

Abstract

The field of neural machine translation (NMT) has changed with the advent of large language models (LLMs). Much of the recent emphasis in natural language processing (NLP) has been on modeling machine translation and many other problems using a single pre-trained Transformer decoder, while encoder-decoder architectures, which were the standard in earlier NMT models, have received relatively less attention. In this paper, we explore translation models that are universal, efficient, and easy to optimize, by marrying the world of LLMs with the world of NMT. We apply LLMs to NMT encoding and leave the NMT decoder unchanged. We also develop methods for adapting LLMs to work better with the NMT decoder. Furthermore, we construct a new dataset involving multiple tasks to assess how well the machine translation system generalizes across various tasks. Evaluations on the WMT and our datasets show that results using our method match or surpass a range of baselines in terms of translation quality, but achieve 2.4 sim 6.5 times inference speedups and a 75% reduction in the memory footprint of the KV cache. It also demonstrates strong generalization across a variety of translation-related tasks.

View Paper