AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

Md Tanvirul Alam, Dipkamal Bhusal, Salman Ahmad, Nidhi Rastogi, Peter Worth

2025-11-04

AthenaBench: A Dynamic Benchmark for Evaluating LLMs in Cyber Threat Intelligence

Summary

This paper investigates how well large language models, which are good at understanding and using language, can be used to help with cybersecurity, specifically analyzing cyber threats. It builds upon existing work to create a better way to test these models in this area.

What's the problem?

Cybersecurity analysts have to sift through tons of reports to understand threats and protect systems. This is a lot of work, and language models could potentially help automate some of it. However, current language models haven't been specifically tested or optimized for these kinds of cybersecurity tasks, so it's unclear how useful they really are. Existing benchmarks weren't comprehensive enough to truly evaluate their abilities.

What's the solution?

The researchers created a new, improved benchmark called AthenaBench. This benchmark includes a better process for collecting data, removes duplicate information, uses more precise ways to measure performance, and adds a new task focused on figuring out how to reduce risks from cyber threats. They then tested twelve different language models – some powerful, closed-source ones like GPT-5 and Gemini, and some freely available, open-source ones – using this new benchmark.

Why it matters?

The results showed that even the best language models still struggle with the more complex parts of cybersecurity analysis, like figuring out who is behind an attack or how to best prevent future attacks. Open-source models performed even worse. This means we need to develop language models specifically designed for cybersecurity to really take advantage of their potential and help analysts stay ahead of threats.

Abstract

Large Language Models (LLMs) have demonstrated strong capabilities in natural language reasoning, yet their application to Cyber Threat Intelligence (CTI) remains limited. CTI analysis involves distilling large volumes of unstructured reports into actionable knowledge, a process where LLMs could substantially reduce analyst workload. CTIBench introduced a comprehensive benchmark for evaluating LLMs across multiple CTI tasks. In this work, we extend CTIBench by developing AthenaBench, an enhanced benchmark that includes an improved dataset creation pipeline, duplicate removal, refined evaluation metrics, and a new task focused on risk mitigation strategies. We evaluate twelve LLMs, including state-of-the-art proprietary models such as GPT-5 and Gemini-2.5 Pro, alongside seven open-source models from the LLaMA and Qwen families. While proprietary LLMs achieve stronger results overall, their performance remains subpar on reasoning-intensive tasks, such as threat actor attribution and risk mitigation, with open-source models trailing even further behind. These findings highlight fundamental limitations in the reasoning capabilities of current LLMs and underscore the need for models explicitly tailored to CTI workflows and automation.

View Paper