BEAVER: An Efficient Deterministic LLM Verifier

Tarun Suresh, Nalin Wadhwa, Debangshu Banerjee, Gagandeep Singh

2025-12-12

BEAVER: An Efficient Deterministic LLM Verifier

Summary

This paper introduces BEAVER, a new system for checking if large language models (LLMs) are behaving as expected and following specific rules.

What's the problem?

When using LLMs in real-world applications, it's crucial to be sure they won't produce harmful or incorrect outputs. Simply testing them with a few examples doesn't guarantee they'll *always* behave correctly. Existing methods for estimating how well an LLM follows rules are often unreliable and don't provide solid proof.

What's the solution?

BEAVER works by carefully exploring all the possible ways an LLM can generate text, using special data structures to keep track of everything efficiently. It figures out, with certainty, the highest possible probability that the LLM will violate a given rule or constraint. This means it doesn't just *guess* if something is safe, it *proves* it.

Why it matters?

BEAVER is a big step forward because it provides a way to confidently assess the risks associated with using LLMs. It's much more accurate than previous methods, finding more potential problems and giving a clearer picture of how likely those problems are to occur. This is vital for building trustworthy AI systems, especially in areas like privacy and security where mistakes can have serious consequences.

Abstract

As large language models (LLMs) transition from research prototypes to production systems, practitioners often need reliable methods to verify that model outputs satisfy required constraints. While sampling-based estimates provide an intuition of model behavior, they offer no sound guarantees. We present BEAVER, the first practical framework for computing deterministic, sound probability bounds on LLM constraint satisfaction. Given any prefix-closed semantic constraint, BEAVER systematically explores the generation space using novel token trie and frontier data structures, maintaining provably sound bounds at every iteration. We formalize the verification problem, prove soundness of our approach, and evaluate BEAVER on correctness verification, privacy verification and secure code generation tasks across multiple state of the art LLMs. BEAVER achieves 6 to 8 times tighter probability bounds and identifies 3 to 4 times more high risk instances compared to baseline methods under identical computational budgets, enabling precise characterization and risk assessment that loose bounds or empirical evaluation cannot provide.

View Paper