HackerRank-ASTRA: Evaluating Correctness & Consistency of Large Language Models on cross-domain multi-file project problems

Jun Xing, Mayur Bhatia, Sahil Phulwani, Darshan Suresh, Rafik Matta

2025-02-06

HackerRank-ASTRA: Evaluating Correctness & Consistency of Large Language
Models on cross-domain multi-file project problems

Summary

This paper talks about HackerRank-ASTRA, a new benchmark for testing how well AI language models can handle real-world coding tasks. It focuses on evaluating both the correctness and consistency of these models when working on complex, multi-file programming projects.

What's the problem?

Current ways of testing AI models for coding skills usually focus on simple, single-file problems or specific programming libraries. This doesn't reflect the complexity of real-world software development, where projects often involve multiple files and different technologies working together. Also, these tests don't check how consistently the AI performs when given the same task multiple times.

What's the solution?

The researchers created HackerRank-ASTRA, which uses project-based coding problems that mimic real-world scenarios. They test each AI model 32 times on each problem to see how consistent its performance is. The benchmark also looks at different coding skills to see where the AI models are strong or weak. They tested this on 65 problems and found that the top three AI models (o1, o1-preview, and Claude-3.5-Sonnet-1022) all scored about 75% on average, with Claude-3.5-Sonnet-1022 being the most consistent.

Why it matters?

This matters because as AI becomes more involved in software development, we need to know how well it can handle real-world tasks. By testing both accuracy and consistency, HackerRank-ASTRA helps developers and companies understand which AI models are most reliable for actual coding projects. This can lead to better AI tools for programming and help improve how humans and AI work together in software development.

Abstract

Evaluating the real-world applicability of large language models (LLMs) provides valuable insights for their development and use in software development tasks. Existing benchmarks often focus on standalone coding problems or specific libraries, overlooking multi-file, project-based scenarios and lacking a rigorous evaluation of consistency. The HackerRank-ASTRA Benchmark introduces project-based coding problems that mirror real-world scenarios. It evaluates model consistency through 32 runs (k = 32) and median standard deviation while incorporating taxonomy-level analysis to assess sub-skill capabilities. Initial evaluations on 65 problems show that the top three models -- o1, o1-preview, and Claude-3.5-Sonnet-1022 -- achieved comparable average scores of 75%, with no statistically significant differences in performance. Notably, Claude-3.5-Sonnet-1022 demonstrated the highest consistency across problems, with low variability (SD = 0.0497), which was statistically significant compared to other models, highlighting its reliability for real-world software development tasks.

View Paper