LiveCC

The platform introduces a comprehensive data production pipeline capable of processing large-scale video datasets, such as YouTube videos and their closed captions, to generate training resources like the Live-CC-5M dataset for pre-training and the Live-WhisperX-526K dataset for supervised fine-tuning. LiveCC’s architecture is based on the Qwen2-VL-7B-Base model, which is further enhanced through streaming pre-training and fine-tuning strategies. This enables the model to perform competitively in general video question answering (QA) tasks and to deliver real-time, context-aware commentary. Notably, the LiveCC-7B-Instruct model has demonstrated the ability to surpass much larger models in commentary quality, even when operating in real-time scenarios.

LiveCC’s capabilities have been rigorously evaluated using benchmarks like LiveSports-3K, which measures the quality and relevance of real-time commentary in sports videos, as well as established video QA benchmarks such as VideoMME and OVOBench. The results show that LiveCC achieves state-of-the-art performance at the 7B/8B parameter scale, making it a highly efficient and generalizable solution for both streaming and offline video understanding. Its open release of models, datasets, and evaluation tools empowers researchers and developers to build, test, and deploy advanced video-language applications without the constraints of proprietary systems.

Key features include:

Real-time video commentary with streaming speech transcription
Temporally aligned vision-language modeling using ASR and video frames
Large-scale data pipeline for processing videos and closed captions
State-of-the-art performance on video QA and commentary benchmarks
Open-source release of models, datasets, and evaluation tools

Subscribe to the AI Search Newsletter