The SynthHuman dataset used to train DAViD contains 300K images of resolution 384×512, covering examples of faces, upper body, and full body scenarios equally. The dataset is designed to be diverse in terms of poses, environments, lighting, and appearances, and is not tailored to any specific evaluation set. This allows DAViD to generalize across a range of benchmark datasets, as well as on in-the-wild data. Along with the RGB rendered image, each sample includes soft foreground mask, surface normals, and depth ground-truth annotations, used to train the models.
DAViD delivers high-quality, detailed results while achieving remarkable efficiency, running orders of magnitude faster than competing methods. The model reliably captures a wide range of human characteristics under diverse lighting conditions, preserving fine-grained details such as hair strands and subtle facial features. This demonstrates the model's robustness and accuracy in complex, real-world scenarios. DAViD uses a single model architecture to tackle three dense prediction tasks, making it a versatile and efficient solution for various computer vision applications.