Stochastic CHAOS: Why Deterministic Inference Kills, and Distributional Variability Is the Heartbeat of Artifical Cognition

Tanmay Joshi, Shourya Aggarwal, Anusa Saha, Aadi Pandey, Shreyash Dhoot, Vighnesh Rai, Raxit Goswami, Aman Chadha, Vinija Jain, Amitava Das

2026-01-13

Stochastic CHAOS: Why Deterministic Inference Kills, and Distributional Variability Is the Heartbeat of Artifical Cognition

Summary

This paper challenges the idea that large language model (LLM) inference should always produce the exact same output given the same input. While traditionally, in computer science, we want programs to be 'deterministic' – meaning predictable – the authors argue that forcing LLMs to be deterministic actually harms their performance and safety.

What's the problem?

The current push for 'deterministic inference' in LLMs, where every run gives the same answer, is based on the desire for reliability and reproducibility. However, the authors contend that LLMs aren't like traditional programs. They work by predicting probabilities of different outputs, and forcing them to pick just *one* answer hides important information about their uncertainty and potential for errors. This makes it seem like the model is more reliable than it actually is, and it can prevent the model from exhibiting its full range of capabilities.

What's the solution?

Instead of trying to eliminate variability, the authors propose embracing it with a framework they call 'Stochastic CHAOS'. This means acknowledging that LLMs naturally produce a range of possible outputs and treating that variability as valuable data. They demonstrate that evaluating an LLM based on a single, deterministic output can be misleading, underestimating both its strengths and weaknesses. They advocate for evaluating LLMs by looking at *multiple* possible outputs to get a more complete picture of how it behaves.

Why it matters?

This research is important because it shifts the conversation around LLM reliability. It suggests that striving for perfect determinism isn't the right goal. Understanding and controlling the natural randomness of LLMs is crucial for building safer, more capable, and more trustworthy AI systems. By acknowledging the inherent uncertainty, we can better identify potential risks and unlock the full potential of these models.

Abstract

Deterministic inference is a comforting ideal in classical software: the same program on the same input should always produce the same output. As large language models move into real-world deployment, this ideal has been imported wholesale into inference stacks. Recent work from the Thinking Machines Lab has presented a detailed analysis of nondeterminism in LLM inference, showing how batch-invariant kernels and deterministic attention can enforce bitwise-identical outputs, positioning deterministic inference as a prerequisite for reproducibility and enterprise reliability. In this paper, we take the opposite stance. We argue that, for LLMs, deterministic inference kills. It kills the ability to model uncertainty, suppresses emergent abilities, collapses reasoning into a single brittle path, and weakens safety alignment by hiding tail risks. LLMs implement conditional distributions over outputs, not fixed functions. Collapsing these distributions to a single canonical completion may appear reassuring, but it systematically conceals properties central to artificial cognition. We instead advocate Stochastic CHAOS, treating distributional variability as a signal to be measured and controlled. Empirically, we show that deterministic inference is systematically misleading. Single-sample deterministic evaluation underestimates both capability and fragility, masking failure probability under paraphrases and noise. Phase-like transitions associated with emergent abilities disappear under greedy decoding. Multi-path reasoning degrades when forced onto deterministic backbones, reducing accuracy and diagnostic insight. Finally, deterministic evaluation underestimates safety risk by hiding rare but dangerous behaviors that appear only under multi-sample evaluation.

View Paper