CL4SE: A Context Learning Benchmark For Software Engineering Tasks
Haichuan Hu, Ye Shang, Guoqing Xie, Congqing He, Quanjun Zhang
2026-03-02
Summary
This paper investigates how to best use 'context' to improve the performance of large language models (LLMs) when applied to software engineering tasks, like writing code or reviewing it, without actually retraining the models themselves.
What's the problem?
While it's been shown that giving LLMs helpful 'context' improves their work on software engineering problems, there wasn't a clear understanding of *what kinds* of context are most useful for *different* tasks. There also wasn't a standard way to measure how much these different context types actually helped. Essentially, researchers needed a way to systematically study and compare context strategies for software engineering.
What's the solution?
The researchers created a new benchmark called CL4SE, which includes a detailed categorization of four types of context relevant to software engineering: examples, project-specific information, step-by-step instructions, and positive/negative examples. They then built datasets with over 13,000 code samples from many open-source projects and tested five popular LLMs on tasks like code generation, summarization, review, and checking if code fixes work. They measured performance using several different metrics.
Why it matters?
This work is important because it provides a standardized way to evaluate how well different context strategies work with LLMs for software engineering. The results show that context learning can significantly improve performance – on average by almost 25% – and highlights which types of context are most effective for specific tasks. This helps developers and researchers design better prompts and use LLMs more effectively in real-world software development scenarios, and provides a dataset for others to build upon.
Abstract
Context engineering has emerged as a pivotal paradigm for unlocking the potential of Large Language Models (LLMs) in Software Engineering (SE) tasks, enabling performance gains at test time without model fine-tuning. Despite its success, existing research lacks a systematic taxonomy of SE-specific context types and a dedicated benchmark to quantify the heterogeneous effects of different contexts across core SE workflows. To address this gap, we propose CL4SE (Context Learning for Software Engineering), a comprehensive benchmark featuring a fine-grained taxonomy of four SE-oriented context types (interpretable examples, project-specific context, procedural decision-making context, and positive & negative context), each mapped to a representative task (code generation, code summarization, code review, and patch correctness assessment). We construct high-quality datasets comprising over 13,000 samples from more than 30 open-source projects and evaluate five mainstream LLMs across nine metrics. Extensive experiments demonstrate that context learning yields an average performance improvement of 24.7% across all tasks. Specifically, procedural context boosts code review performance by up to 33% (Qwen3-Max), mixed positive-negative context improves patch assessment by 30% (DeepSeek-V3), project-specific context increases code summarization BLEU by 14.78% (GPT-Oss-120B), and interpretable examples enhance code generation PASS@1 by 5.72% (DeepSeek-V3). CL4SE establishes the first standardized evaluation framework for SE context learning, provides actionable empirical insights into task-specific context design, and releases a large-scale dataset to facilitate reproducible research in this domain.