Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Kai-Po Chang, Wei-Yuan Cheng, Chi-Pin Huang, Fu-En Yang, Yu-Chiang Frank Wang

2025-12-05

Mitigating Object and Action Hallucinations in Multimodal LLMs via Self-Augmented Contrastive Alignment

Summary

This paper focuses on improving how well large language models that can understand both images and videos (called multimodal LLMs) describe what's happening in videos. Specifically, it addresses the issue of these models 'hallucinating' – making up details that aren't actually present in the video.

What's the problem?

Current multimodal LLMs are really good at *sounding* like they understand videos and creating descriptions, but they often get facts wrong. They might say there's a cat when there's a dog, or claim someone is running when they're walking. Previous attempts to fix this problem focused on still images, but it's much harder to correct these errors in videos because you have to consider both what objects are present *and* what actions are happening over time. It's a challenge to make sure the descriptions accurately reflect both the visual elements and the events unfolding.

What's the solution?

The researchers developed a system called SANTA, which stands for Self-Augmented Contrastive Alignment. SANTA works in two main ways. First, it tries to predict what kinds of mistakes the model might make and creates 'negative examples' to help the model learn what *not* to say. Second, it carefully compares specific objects and actions in the video with the words used to describe them, making sure they align correctly. It focuses on linking visual parts of the video with the phrases used to describe them, and also connects actions with phrases that explain how things are changing over time.

Why it matters?

This research is important because it makes these multimodal LLMs more reliable. If we want to use these models for things like video analysis, automated reporting, or helping people with visual impairments, we need to be sure the information they provide is accurate. By reducing hallucinations, SANTA helps build trust in these powerful AI systems and opens up possibilities for more practical applications.

Abstract

Recent advancement in multimodal LLMs (MLLMs) has demonstrated their remarkable capability to generate descriptive captions for input videos. However, these models suffer from factual inaccuracies in the generated descriptions, causing severe hallucination issues. While prior works have explored alleviating hallucinations for static images, jointly mitigating visual object and temporal action hallucinations for dynamic videos remains a challenging and unsolved task. To tackle this challenge, we propose a Self-Augmented Contrastive Alignment (SANTA) framework for enabling object and action faithfulness by exempting the spurious correlations and enforcing the emphasis on visual facts. SANTA employs a hallucinative self-augmentation scheme to identify the potential hallucinations that lie in the MLLM and transform the original captions to the contrasted negatives. Furthermore, we develop a tracklet-phrase contrastive alignment to match the regional objects and relation-guided actions with their corresponding visual and temporal phrases. Extensive experiments demonstrate that SANTA outperforms existing methods in alleviating object and action hallucinations, yielding superior performance on the hallucination examination benchmarks.

View Paper