BhashaBench V1: A Comprehensive Benchmark for the Quadrant of Indic Domains
Vijay Devane, Mohd Nauman, Bhargav Patel, Aniket Mahendra Wakchoure, Yogeshkumar Sant, Shyam Pawar, Viraj Thakur, Ananya Godse, Sunil Patra, Neha Maurya, Suraj Racha, Nitish Kamal Singh, Ajay Nagpal, Piyush Sawarkar, Kundeshwar Vijayrao Pundalik, Rohit Saluja, Ganesh Ramakrishnan
2025-10-30
Summary
This paper introduces a new way to test how well large language models, like ChatGPT, understand information specific to India, in both English and Hindi.
What's the problem?
Currently, the tests used to evaluate these language models are mostly based on Western knowledge and don't focus on specific areas important to India. This means we don't really know how well they perform on topics like Indian agriculture, law, finance, or traditional medicine like Ayurveda, and they are biased towards English. Existing tests aren't designed to check if these models can handle the nuances of Indian culture and specialized knowledge.
What's the solution?
The researchers created a benchmark called BhashaBench V1. It’s a huge collection of over 74,000 questions and answers, split between English and Hindi, covering four key areas – agriculture, law, finance, and Ayurveda – with lots of specific subtopics within each. They then tested 29 different language models using this benchmark to see how they did. They looked at overall performance and also how well the models did in specific areas, like cyber law versus traditional Ayurvedic practices.
Why it matters?
This work is important because it provides a much more accurate way to assess how useful these language models are for people in India. It highlights where the models are strong and, more importantly, where they struggle, especially with topics specific to India and when using Hindi. This information can help developers improve the models and make them more relevant and reliable for a wider range of users and applications within the Indian context.
Abstract
The rapid advancement of large language models(LLMs) has intensified the need for domain and culture specific evaluation. Existing benchmarks are largely Anglocentric and domain-agnostic, limiting their applicability to India-centric contexts. To address this gap, we introduce BhashaBench V1, the first domain-specific, multi-task, bilingual benchmark focusing on critical Indic knowledge systems. BhashaBench V1 contains 74,166 meticulously curated question-answer pairs, with 52,494 in English and 21,672 in Hindi, sourced from authentic government and domain-specific exams. It spans four major domains: Agriculture, Legal, Finance, and Ayurveda, comprising 90+ subdomains and covering 500+ topics, enabling fine-grained evaluation. Evaluation of 29+ LLMs reveals significant domain and language specific performance gaps, with especially large disparities in low-resource domains. For instance, GPT-4o achieves 76.49% overall accuracy in Legal but only 59.74% in Ayurveda. Models consistently perform better on English content compared to Hindi across all domains. Subdomain-level analysis shows that areas such as Cyber Law, International Finance perform relatively well, while Panchakarma, Seed Science, and Human Rights remain notably weak. BhashaBench V1 provides a comprehensive dataset for evaluating large language models across India's diverse knowledge domains. It enables assessment of models' ability to integrate domain-specific knowledge with bilingual understanding. All code, benchmarks, and resources are publicly available to support open research.