Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

Yuchen Yang, Yuqing Shao, Duxiu Huang, Linfeng Dong, Yifei Liu, Suixin Tang, Xiang Zhou, Yuanyuan Gao, Wei Wang, Yue Zhou, Xue Yang, Yanfeng Wang, Xiao Sun, Zhihang Zhong

2026-03-11

Stepping VLMs onto the Court: Benchmarking Spatial Intelligence in Sports

Summary

This paper introduces CourtSI, a new and large dataset designed to test how well artificial intelligence, specifically vision-language models, can understand spatial relationships in sports like badminton, tennis, and table tennis.

What's the problem?

Current AI models struggle with understanding the complex spatial reasoning needed to analyze sports scenes, like figuring out distances between players, counting objects on the court, or understanding their positions relative to each other. Existing datasets aren't specifically designed to challenge these skills in the dynamic and fast-paced environment of sports, so it's hard to tell how well AI *really* understands spatial concepts.

What's the solution?

The researchers created CourtSI, which includes over a million questions and answers about spatial elements in sports videos. They used the defined lines of the courts as a guide to automatically build these scenes and questions. They also created a smaller, carefully checked set of questions called CourtSI-Bench to accurately measure AI performance. Finally, they took an existing AI model and improved its spatial understanding by training it specifically on the CourtSI dataset.

Why it matters?

This work is important because it provides a better way to evaluate and improve the 'spatial intelligence' of AI. By focusing on the challenging domain of sports, the researchers have shown that current AI models still have a long way to go in understanding space and relationships, and that specialized training data can significantly boost their performance. This could lead to AI that can not only *watch* sports but also *understand* what's happening and even provide insightful commentary.

Abstract

Sports have long attracted broad attention as they push the limits of human physical and cognitive capabilities. Amid growing interest in spatial intelligence for vision-language models (VLMs), sports provide a natural testbed for understanding high-intensity human motion and dynamic object interactions. To this end, we present CourtSI, the first large-scale spatial intelligence dataset tailored to sports scenarios. CourtSI contains over 1M QA pairs, organized under a holistic taxonomy that systematically covers spatial counting, distance measurement, localization, and relational reasoning, across representative net sports including badminton, tennis, and table tennis. Leveraging well-defined court geometry as metric anchors, we develop a semi-automatic data engine to reconstruct sports scenes, enabling scalable curation of CourtSI. In addition, we introduce CourtSI-Bench, a high-quality evaluation benchmark comprising 3,686 QA pairs with rigorous human verification. We evaluate 25 proprietary and open-source VLMs on CourtSI-Bench, revealing a remaining human-AI performance gap and limited generalization from existing spatial intelligence benchmarks. These findings indicate that sports scenarios expose limitations in spatial intelligence capabilities captured by existing benchmarks. Further, fine-tuning Qwen3-VL-8B on CourtSI improves accuracy on CourtSI-Bench by 23.5 percentage points. The adapted model also generalizes effectively to CourtSI-Ext, an evaluation set built on a similar but unseen sport, and demonstrates enhanced spatial-aware commentary generation. Together, these findings demonstrate that CourtSI provides a scalable pathway toward advancing spatial intelligence of VLMs in sports.

View Paper