LuxDiT: Lighting Estimation with Video Diffusion Transformer
Ruofan Liang, Kai He, Zan Gojcic, Igor Gilitschenski, Sanja Fidler, Nandita Vijaykumar, Zian Wang
2025-09-08
Summary
This paper introduces a new method, called LuxDiT, for figuring out the lighting in a scene just by looking at a single image or video.
What's the problem?
Determining the lighting in a scene from an image is really hard. Existing computer programs struggle because they need lots of examples of images with *perfect* lighting information, which is expensive and difficult to get. While new AI models are good at creating images, they still have trouble with lighting because it's based on subtle clues, requires understanding the whole scene, and needs to represent a wide range of brightness levels.
What's the solution?
The researchers trained an AI model, specifically a type called a video diffusion transformer, to *generate* the lighting information (represented as an HDR environment map) based on the image it sees. They first trained it on a huge collection of artificially created images with different lighting setups, so it could learn the connection between what things look like and how they're lit. Then, they refined the model using real-world panoramic images to make sure the lighting predictions matched what we'd expect to see in real life. A special technique called low-rank adaptation helped the AI understand the relationship between the image and the lighting.
Why it matters?
This work is important because accurately estimating lighting is crucial for things like creating realistic computer graphics, editing photos, and helping robots understand their surroundings. LuxDiT is better than previous methods at predicting lighting, producing more accurate and detailed results, which means we can create more realistic and immersive visual experiences.
Abstract
Estimating scene lighting from a single image or video remains a longstanding challenge in computer vision and graphics. Learning-based approaches are constrained by the scarcity of ground-truth HDR environment maps, which are expensive to capture and limited in diversity. While recent generative models offer strong priors for image synthesis, lighting estimation remains difficult due to its reliance on indirect visual cues, the need to infer global (non-local) context, and the recovery of high-dynamic-range outputs. We propose LuxDiT, a novel data-driven approach that fine-tunes a video diffusion transformer to generate HDR environment maps conditioned on visual input. Trained on a large synthetic dataset with diverse lighting conditions, our model learns to infer illumination from indirect visual cues and generalizes effectively to real-world scenes. To improve semantic alignment between the input and the predicted environment map, we introduce a low-rank adaptation finetuning strategy using a collected dataset of HDR panoramas. Our method produces accurate lighting predictions with realistic angular high-frequency details, outperforming existing state-of-the-art techniques in both quantitative and qualitative evaluations.