Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

Duygu Altinok

2025-12-30

Introducing TrGLUE and SentiTurca: A Comprehensive Benchmark for Turkish General Language Understanding and Sentiment Analysis

Summary

This paper introduces TrGLUE, a new benchmark designed to test how well computer models understand the Turkish language. It also includes SentiTurca, a benchmark specifically for analyzing opinions in Turkish text.

What's the problem?

Evaluating how well AI understands language, known as Natural Language Understanding or NLU, is really important for improving these systems. While good benchmarks exist for languages like English, Chinese, French, and Japanese, there wasn't a good way to measure NLU specifically for Turkish. This made it hard to track progress and compare different AI models working with Turkish.

What's the solution?

The researchers created TrGLUE, a collection of different tasks that test various aspects of Turkish language understanding. They built these tasks using existing Turkish texts and a clever system for creating labels – they started with AI-generated labels, checked them for consistency between different AI models, and then had humans review and confirm them. They also created SentiTurca for sentiment analysis. To help others use these benchmarks, they provided code for fine-tuning and testing popular transformer models.

Why it matters?

TrGLUE provides a much-needed standard for evaluating NLU in Turkish. This will help researchers develop better AI models for the Turkish language, and it offers a way to create high-quality datasets using a combination of AI and human effort. Ultimately, it pushes the field forward by providing the tools to measure and improve Turkish language AI.

Abstract

Evaluating the performance of various model architectures, such as transformers, large language models (LLMs), and other NLP systems, requires comprehensive benchmarks that measure performance across multiple dimensions. Among these, the evaluation of natural language understanding (NLU) is particularly critical as it serves as a fundamental criterion for assessing model capabilities. Thus, it is essential to establish benchmarks that enable thorough evaluation and analysis of NLU abilities from diverse perspectives. While the GLUE benchmark has set a standard for evaluating English NLU, similar benchmarks have been developed for other languages, such as CLUE for Chinese, FLUE for French, and JGLUE for Japanese. However, no comparable benchmark currently exists for the Turkish language. To address this gap, we introduce TrGLUE, a comprehensive benchmark encompassing a variety of NLU tasks for Turkish. In addition, we present SentiTurca, a specialized benchmark for sentiment analysis. To support researchers, we also provide fine-tuning and evaluation code for transformer-based models, facilitating the effective use of these benchmarks. TrGLUE comprises Turkish-native corpora curated to mirror the domains and task formulations of GLUE-style evaluations, with labels obtained through a semi-automated pipeline that combines strong LLM-based annotation, cross-model agreement checks, and subsequent human validation. This design prioritizes linguistic naturalness, minimizes direct translation artifacts, and yields a scalable, reproducible workflow. With TrGLUE, our goal is to establish a robust evaluation framework for Turkish NLU, empower researchers with valuable resources, and provide insights into generating high-quality semi-automated datasets.

View Paper