IndicParam: Benchmark to evaluate LLMs on low-resource Indic Languages
Ayush Maheshwari, Kaushal Sharma, Vivek Patel, Aditya Maheshwari
2025-12-02
Summary
This paper introduces a new test, called IndicParam, designed to see how well large language models (LLMs) – like ChatGPT – understand and can answer questions in a variety of Indian languages that don't have a lot of digital resources available.
What's the problem?
Currently, LLMs are really good at languages with tons of online text and data, but they struggle with languages like Nepali, Gujarati, and others spoken in India that haven't been widely used to train these models. There wasn't a good way to specifically measure how well these models performed on these less common languages, making it hard to know where they needed improvement.
What's the solution?
The researchers created IndicParam, a collection of over 13,000 multiple-choice questions in 11 different Indian languages, ranging from those with some resources to those with very few. They then tested 19 different LLMs, including powerful ones like GPT-5, on these questions. They also categorized the questions to see if the models were better at recalling facts or understanding grammar, and tested different question types beyond simple multiple choice.
Why it matters?
This work is important because it highlights the limitations of current LLMs when it comes to less-represented languages. IndicParam provides a challenging benchmark for developers to improve these models and ensure they work well for a wider range of users, not just those who speak commonly used languages. It also helps researchers understand how well knowledge transfers from well-resourced languages to those with fewer resources.
Abstract
While large language models excel on high-resource multilingual tasks, low- and extremely low-resource Indic languages remain severely under-evaluated. We present IndicParam, a human-curated benchmark of over 13,000 multiple-choice questions covering 11 such languages (Nepali, Gujarati, Marathi, Odia as low-resource; Dogri, Maithili, Rajasthani, Sanskrit, Bodo, Santali, Konkani as extremely low-resource) plus Sanskrit-English code-mixed set. We evaluated 19 LLMs, both proprietary and open-weights, which reveals that even the top-performing GPT-5 reaches only 45.0% average accuracy, followed by DeepSeek-3.2 (43.1) and Claude-4.5 (42.7). We additionally label each question as knowledge-oriented or purely linguistic to discriminate factual recall from grammatical proficiency. Further, we assess the ability of LLMs to handle diverse question formats-such as list-based matching, assertion-reason pairs, and sequence ordering-alongside conventional multiple-choice questions. IndicParam provides insights into limitations of cross-lingual transfer and establishes a challenging benchmark for Indic languages. The dataset is available at https://huggingface.co/datasets/bharatgenai/IndicParam. Scripts to run benchmark are present at https://github.com/ayushbits/IndicParam.