Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents

Sourena Khanzadeh

2026-01-06

Project Ariadne: A Structural Causal Framework for Auditing Faithfulness in LLM Agents

Summary

This paper investigates whether the explanations given by AI agents, specifically large language models, are actually *why* they made a decision, or just justifications created *after* the decision was made. It introduces a new method to test the honesty of these explanations.

What's the problem?

As AI agents become more powerful and make important decisions on their own, we need to be able to understand *how* they're thinking. Current AI systems use 'Chain-of-Thought' prompting, where they show their reasoning step-by-step, but it's unclear if these steps truly reflect the agent's decision-making process or are just made up to sound logical. Essentially, are they actually reasoning, or just putting on a 'Reasoning Theater'?

What's the solution?

The researchers developed a framework called Project Ariadne that uses a technique called 'Structural Causal Models'. This allows them to test the AI's reasoning by making changes to the information it's given – like flipping a premise or negating a fact – and seeing if the final answer changes accordingly. If the answer *doesn't* change even when the reasoning is altered, it suggests the reasoning isn't actually driving the decision. They measure something called 'Causal Sensitivity' to quantify this.

Why it matters?

The study found that many AI agents exhibit a 'Faithfulness Gap', meaning their explanations aren't reliable. They can arrive at the same conclusion even with contradictory reasoning. This is a big problem because if we can't trust the explanations, we can't trust the AI's decisions, especially in high-stakes situations. The Ariadne Score provides a new way to evaluate and improve the alignment between an AI's logic and its actions.

Abstract

As Large Language Model (LLM) agents are increasingly tasked with high-stakes autonomous decision-making, the transparency of their reasoning processes has become a critical safety concern. While Chain-of-Thought (CoT) prompting allows agents to generate human-readable reasoning traces, it remains unclear whether these traces are faithful generative drivers of the model's output or merely post-hoc rationalizations. We introduce Project Ariadne, a novel XAI framework that utilizes Structural Causal Models (SCMs) and counterfactual logic to audit the causal integrity of agentic reasoning. Unlike existing interpretability methods that rely on surface-level textual similarity, Project Ariadne performs hard interventions (do-calculus) on intermediate reasoning nodes -- systematically inverting logic, negating premises, and reversing factual claims -- to measure the Causal Sensitivity (φ) of the terminal answer. Our empirical evaluation of state-of-the-art models reveals a persistent Faithfulness Gap. We define and detect a widespread failure mode termed Causal Decoupling, where agents exhibit a violation density (ρ) of up to 0.77 in factual and scientific domains. In these instances, agents arrive at identical conclusions despite contradictory internal logic, proving that their reasoning traces function as "Reasoning Theater" while decision-making is governed by latent parametric priors. Our findings suggest that current agentic architectures are inherently prone to unfaithful explanation, and we propose the Ariadne Score as a new benchmark for aligning stated logic with model action.

View Paper