MaskLID: Code-Switching Language Identification through Iterative Masking

Amir Hossein Kargaran, François Yvon, Hinrich Schütze

2024-06-17

MaskLID: Code-Switching Language Identification through Iterative Masking

Summary

This paper introduces MaskLID, a new method for identifying languages in sentences that switch between two or more languages, known as code-switching. MaskLID is unique because it doesn't require any training and works alongside existing language identification systems.

What's the problem?

Many language identification systems are designed to classify sentences written in only one language. When a sentence includes words from multiple languages, these systems often only recognize the dominant language and ignore the others. This can lead to inaccurate results, especially in multilingual contexts where people frequently mix languages.

What's the solution?

To solve this problem, MaskLID uses a technique called masking. It identifies parts of the sentence that belong to the dominant language and temporarily hides them. This allows the system to focus on the remaining text and identify other languages present in the sentence. By doing this iteratively, MaskLID can effectively recognize multiple languages in a single sentence without needing additional training data or resources.

Why it matters?

This research is important because it improves how we can understand and process mixed-language sentences, which are common in many communities around the world. By enhancing language identification in code-switching situations, MaskLID can help develop better tools for translation, communication, and understanding in multilingual environments.

Abstract

We present MaskLID, a simple, yet effective, code-switching (CS) language identification (LID) method. MaskLID does not require any training and is designed to complement current high-performance sentence-level LIDs. Sentence-level LIDs are classifiers trained on monolingual texts to provide single labels, typically using a softmax layer to turn scores into probabilities. However, in cases where a sentence is composed in both L1 and L2 languages, the LID classifier often only returns the dominant label L1. To address this limitation, MaskLID employs a strategy to mask text features associated with L1, allowing the LID to classify the text as L2 in the next round. This method uses the LID itself to identify the features that require masking and does not rely on any external resource. In this work, we explore the use of MaskLID for two open-source LIDs (GlotLID and OpenLID), that are both based on the FastText architecture. Code and demo are available at https://github.com/cisnlp/MaskLID.

View Paper