Expanding FLORES+ Benchmark for more Low-Resource Settings: Portuguese-Emakhuwa Machine Translation Evaluation
Felermino D. M. Antonio Ali, Henrique Lopes Cardoso, Rui Sousa-Silva
2024-08-22
Summary
This paper discusses the expansion of the FLORES+ benchmark to include Emakhuwa, a low-resource language from Mozambique, and evaluates machine translation from Portuguese to Emakhuwa.
What's the problem?
Many languages, like Emakhuwa, do not have enough resources or data for effective machine translation. This makes it hard for translation models to learn and perform well, especially when trying to translate between languages that are not widely used or studied.
What's the solution?
The authors created new datasets by translating existing Portuguese data into Emakhuwa and ensured quality through careful checks and edits. They trained a Neural Machine Translation system on this data and tested its performance. Their results showed that while there are challenges like spelling inconsistencies in Emakhuwa, their approach provides a foundation for improving translations for this language.
Why it matters?
This research is important because it helps improve machine translation for low-resource languages, which can lead to better communication and understanding in diverse communities. By expanding resources for Emakhuwa, it opens up opportunities for more people to access information and services in their native language.
Abstract
As part of the Open Language Data Initiative shared tasks, we have expanded the FLORES+ evaluation set to include Emakhuwa, a low-resource language widely spoken in Mozambique. We translated the dev and devtest sets from Portuguese into Emakhuwa, and we detail the translation process and quality assurance measures used. Our methodology involved various quality checks, including post-editing and adequacy assessments. The resulting datasets consist of multiple reference sentences for each source. We present baseline results from training a Neural Machine Translation system and fine-tuning existing multilingual translation models. Our findings suggest that spelling inconsistencies remain a challenge in Emakhuwa. Additionally, the baseline models underperformed on this evaluation set, underscoring the necessity for further research to enhance machine translation quality for Emakhuwa. The data is publicly available at https://huggingface.co/datasets/LIACC/Emakhuwa-FLORES.