How Confident are Video Models? Empowering Video Models to Express their Uncertainty

Zhiting Mei, Ola Shorinwa, Anirudha Majumdar

2025-10-06

How Confident are Video Models? Empowering Video Models to Express their Uncertainty

Summary

This paper tackles the issue of 'hallucinations' in videos created by artificial intelligence, meaning the AI sometimes makes up things that aren't true even if the video looks realistic.

What's the problem?

AI video generators are getting really good at making videos from text descriptions, but they often confidently create videos with incorrect information. We have ways to check how sure language models are about their answers, but no similar methods exist for video models, which is a safety concern because these videos could spread misinformation. Essentially, we don't know when to trust the videos the AI makes.

What's the solution?

The researchers created a new system called S-QUBED to measure how uncertain a video AI is about its creations. It works by looking at the 'hidden code' the AI uses to generate the video, separating uncertainty caused by a vague request from uncertainty caused by the AI simply not knowing the answer. They also developed a way to evaluate how well the system works and a dataset to help others test similar methods.

Why it matters?

This research is important because it's the first step towards building safer and more reliable AI video generators. By being able to quantify uncertainty, we can identify potentially misleading videos and improve the AI's ability to create accurate content, which is crucial as these tools become more widespread.

Abstract

Generative video models demonstrate impressive text-to-video capabilities, spurring widespread adoption in many real-world applications. However, like large language models (LLMs), video generation models tend to hallucinate, producing plausible videos even when they are factually wrong. Although uncertainty quantification (UQ) of LLMs has been extensively studied in prior work, no UQ method for video models exists, raising critical safety concerns. To our knowledge, this paper represents the first work towards quantifying the uncertainty of video models. We present a framework for uncertainty quantification of generative video models, consisting of: (i) a metric for evaluating the calibration of video models based on robust rank correlation estimation with no stringent modeling assumptions; (ii) a black-box UQ method for video models (termed S-QUBED), which leverages latent modeling to rigorously decompose predictive uncertainty into its aleatoric and epistemic components; and (iii) a UQ dataset to facilitate benchmarking calibration in video models. By conditioning the generation task in the latent space, we disentangle uncertainty arising due to vague task specifications from that arising from lack of knowledge. Through extensive experiments on benchmark video datasets, we demonstrate that S-QUBED computes calibrated total uncertainty estimates that are negatively correlated with the task accuracy and effectively computes the aleatoric and epistemic constituents.

View Paper