The platform introduces a comprehensive data production pipeline capable of processing large-scale video datasets, such as YouTube videos and their closed captions, to generate training resources like the Live-CC-5M dataset for pre-training and the Live-WhisperX-526K dataset for supervised fine-tuning. LiveCC’s architecture is based on the Qwen2-VL-7B-Base model, which is further enhanced through streaming pre-training and fine-tuning strategies. This enables the model to perform competitively in general video question answering (QA) tasks and to deliver real-time, context-aware commentary. Notably, the LiveCC-7B-Instruct model has demonstrated the ability to surpass much larger models in commentary quality, even when operating in real-time scenarios.


LiveCC’s capabilities have been rigorously evaluated using benchmarks like LiveSports-3K, which measures the quality and relevance of real-time commentary in sports videos, as well as established video QA benchmarks such as VideoMME and OVOBench. The results show that LiveCC achieves state-of-the-art performance at the 7B/8B parameter scale, making it a highly efficient and generalizable solution for both streaming and offline video understanding. Its open release of models, datasets, and evaluation tools empowers researchers and developers to build, test, and deploy advanced video-language applications without the constraints of proprietary systems.


Key features include:


  • Real-time video commentary with streaming speech transcription
  • Temporally aligned vision-language modeling using ASR and video frames
  • Large-scale data pipeline for processing videos and closed captions
  • State-of-the-art performance on video QA and commentary benchmarks
  • Open-source release of models, datasets, and evaluation tools

Get more likes & reach the top of search results by adding this button on your site!

Featured on

AI Search

35

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!