EgoNight: Towards Egocentric Vision Understanding at Night with a Challenging Benchmark
Deheng Zhang, Yuqian Fu, Runyi Yang, Yang Miao, Tianwen Qian, Xu Zheng, Guolei Sun, Ajad Chhatkuli, Xuanjing Huang, Yu-Gang Jiang, Luc Van Gool, Danda Pani Paudel
2025-10-08
Summary
This paper introduces EgoNight, a new dataset designed to test how well computers 'see' and understand the world from a first-person perspective, specifically at night. It focuses on visual question answering, where a computer is shown a video and asked questions about it.
What's the problem?
Current tests for egocentric (first-person) vision mostly focus on daytime scenes. However, real-world applications like self-driving cars or assistive technology need to work well in all lighting conditions, including at night. There wasn't a good way to measure how well these systems performed in low-light situations, making it hard to improve them.
What's the solution?
The researchers created EgoNight by collecting both real and computer-generated videos of people performing actions in various environments, both during the day and at night. They cleverly aligned the day and night videos of the same scenes, which helped them create accurate questions and answers for the nighttime videos. They used a combination of automatic labeling and human checking to ensure the quality of the dataset, resulting in over 3600 question-answer pairs. They also added two extra tasks: matching day and night scenes, and estimating depth in nighttime videos.
Why it matters?
EgoNight is important because it provides a challenging and realistic benchmark for evaluating and improving computer vision systems that need to operate in low-light conditions. By showing that current systems struggle with nighttime vision, it encourages researchers to develop more robust and reliable models that can handle a wider range of real-world scenarios, ultimately leading to better applications like safer self-driving cars and more helpful assistive devices.
Abstract
Most existing benchmarks for egocentric vision understanding focus primarily on daytime scenarios, overlooking the low-light conditions that are inevitable in real-world applications. To investigate this gap, we present EgoNight, the first comprehensive benchmark for nighttime egocentric vision, with visual question answering (VQA) as the core task. A key feature of EgoNight is the introduction of day-night aligned videos, which enhance night annotation quality using the daytime data and reveal clear performance gaps between lighting conditions. To achieve this, we collect both synthetic videos rendered by Blender and real-world recordings, ensuring that scenes and actions are visually and temporally aligned. Leveraging these paired videos, we construct EgoNight-VQA, supported by a novel day-augmented night auto-labeling engine and refinement through extensive human verification. Each QA pair is double-checked by annotators for reliability. In total, EgoNight-VQA contains 3658 QA pairs across 90 videos, spanning 12 diverse QA types, with more than 300 hours of human work. Evaluations of state-of-the-art multimodal large language models (MLLMs) reveal substantial performance drops when transferring from day to night, underscoring the challenges of reasoning under low-light conditions. Beyond VQA, EgoNight also introduces two auxiliary tasks, day-night correspondence retrieval and egocentric depth estimation at night, that further explore the boundaries of existing models. We believe EgoNight-VQA provides a strong foundation for advancing application-driven egocentric vision research and for developing models that generalize across illumination domains. All the data and code will be made available upon acceptance.