HAIC: Improving Human Action Understanding and Generation with Better Captions for Multi-modal Large Language Models

Xiao Wang, Jingyun Hua, Weihong Lin, Yuanxing Zhang, Fuzheng Zhang, Jianlong Wu, Di Zhang, Liqiang Nie

2025-03-03

HAIC: Improving Human Action Understanding and Generation with Better
Captions for Multi-modal Large Language Models

Summary

This paper talks about HAIC, a system designed to help AI models better understand videos that show human actions by providing detailed captions and training data.

What's the problem?

AI models struggle to understand videos involving human actions because the captions they are trained on are often too simple and lack important details. This limits their ability to accurately describe or analyze what is happening in the video.

What's the solution?

The researchers created a two-step process to collect and annotate videos with high-quality captions. These captions include specific details about human attributes, actions, and interactions in chronological order. They used this process to build two datasets: HAICTrain for training AI models and HAICBench for testing them. By training AI models on HAICTrain, they significantly improved the models' ability to understand human actions in videos and even enhanced their ability to generate text-to-video results.

Why it matters?

This matters because it makes AI systems better at analyzing and describing human actions in videos, which could be useful for applications like security monitoring, sports analysis, or helping people with disabilities. The detailed captions also set a new standard for training AI on human action understanding, pushing the technology forward.

Abstract

Recent Multi-modal Large Language Models (MLLMs) have made great progress in video understanding. However, their performance on videos involving human actions is still limited by the lack of high-quality data. To address this, we introduce a two-stage data annotation pipeline. First, we design strategies to accumulate videos featuring clear human actions from the Internet. Second, videos are annotated in a standardized caption format that uses human attributes to distinguish individuals and chronologically details their actions and interactions. Through this pipeline, we curate two datasets, namely HAICTrain and HAICBench. HAICTrain comprises 126K video-caption pairs generated by Gemini-Pro and verified for training purposes. Meanwhile, HAICBench includes 500 manually annotated video-caption pairs and 1,400 QA pairs, for a comprehensive evaluation of human action understanding. Experimental results demonstrate that training with HAICTrain not only significantly enhances human understanding abilities across 4 benchmarks, but can also improve text-to-video generation results. Both the HAICTrain and HAICBench are released at https://huggingface.co/datasets/KuaishouHAIC/HAIC.

View Paper