DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in Digital Forensics and Incident Response

Bilel Cherif, Tamas Bisztray, Richard A. Dubniczky, Aaesha Aldahmani, Saeed Alshehhi, Norbert Tihanyi

2025-05-28

DFIR-Metric: A Benchmark Dataset for Evaluating Large Language Models in
Digital Forensics and Incident Response

Summary

This paper talks about DFIR-Metric, a new set of tests designed to see how well large language models can help with digital forensics and investigating computer security incidents.

What's the problem?

The problem is that while AI models are being used more in cybersecurity, there hasn't been a good way to measure how well they actually understand and solve real-life digital forensics problems, especially when the tasks are tough and the AI might not get the right answer.

What's the solution?

To fix this, the researchers created DFIR-Metric, which includes a variety of challenges like knowledge quizzes, realistic investigation scenarios, and hands-on analysis tasks. They also introduced a new score called Task Understanding Score to measure how well the AI understands the task, even if it doesn't get the exact answer right.

Why it matters?

This is important because it helps people know which AI models are actually useful for digital forensics work, making it easier to trust AI tools in cybersecurity and helping experts choose the best ones for real investigations.

Abstract

DFIR-Metric evaluates Large Language Models for digital forensics using a comprehensive benchmark with knowledge assessments, realistic forensic challenges, and practical analysis cases, introducing a Task Understanding Score for near-zero accuracy scenarios.

View Paper