Beyond Monolingual Assumptions: A Survey of Code-Switched NLP in the Era of Large Language Models
Rajvee Sheth, Samridhi Raj Sinha, Mahavir Patil, Himanshu Beniwal, Mayank Singh
2025-10-09
Summary
This paper is a broad overview of how well large language models, the powerful AI systems behind things like ChatGPT, handle code-switching – when people naturally mix different languages within a single conversation or piece of writing.
What's the problem?
Even though large language models are getting really good at understanding language, they still struggle with code-switching. This is because they haven't been trained on enough examples of mixed languages, the datasets used to test them aren't very fair to code-switchers, and the models often don't understand the nuances of how languages interact when mixed together. This limits their usefulness in places where people commonly use multiple languages.
What's the solution?
The researchers did a really thorough review of existing studies on code-switching and large language models. They looked at over 80 studies, covering many different languages and tasks, and categorized the different approaches researchers are taking to improve these models. They grouped these approaches by how the models are built, how they are trained, and how they are tested, highlighting what's working and what still needs improvement.
Why it matters?
This work is important because it points the way towards building AI that can truly understand and interact with people in all their linguistic diversity. It emphasizes the need for better datasets that represent real-world language use, fairer ways to evaluate these models, and a deeper understanding of language itself to create AI that is genuinely multilingual and inclusive.
Abstract
Code-switching (CSW), the alternation of languages and scripts within a single utterance, remains a fundamental challenge for multiling ual NLP, even amidst the rapid advances of large language models (LLMs). Most LLMs still struggle with mixed-language inputs, limited CSW datasets, and evaluation biases, hindering deployment in multilingual societies. This survey provides the first comprehensive analysis of CSW-aware LLM research, reviewing unique_references studies spanning five research areas, 12 NLP tasks, 30+ datasets, and 80+ languages. We classify recent advances by architecture, training strategy, and evaluation methodology, outlining how LLMs have reshaped CSW modeling and what challenges persist. The paper concludes with a roadmap emphasizing the need for inclusive datasets, fair evaluation, and linguistically grounded models to achieve truly multilingual intelligence. A curated collection of all resources is maintained at https://github.com/lingo-iitgn/awesome-code-mixing/.