Style-NeRF2NeRF: 3D Style Transfer From Style-Aligned Multi-View Images
Haruo Fujiwara, Yusuke Mukuta, Tatsuya Harada
2024-06-24

Summary
This paper discusses a new method called Style-NeRF2NeRF that allows for the stylization of 3D scenes using images. It combines techniques from 2D image processing with 3D models to create visually appealing scenes in different artistic styles.
What's the problem?
Creating artistic styles in 3D environments can be challenging because traditional methods often require a lot of manual effort and high-quality input images. This makes it difficult to apply different styles to 3D scenes efficiently.
What's the solution?
The researchers developed a pipeline that uses a NeRF (Neural Radiance Fields) model, which is a type of 3D representation, and enhances it with stylized images generated from a diffusion model. They first create similar multi-view images based on a target style and then guide the style transfer process using advanced loss functions that help maintain quality. This approach allows users to experiment with different styles and see results before finalizing the 3D model.
Why it matters?
This work is important because it simplifies the process of applying artistic styles to 3D scenes, making it accessible for artists and developers. By allowing for quick previews and adjustments, it enhances creativity and efficiency in digital art and design.
Abstract
We propose a simple yet effective pipeline for stylizing a 3D scene, harnessing the power of 2D image diffusion models. Given a NeRF model reconstructed from a set of multi-view images, we perform 3D style transfer by refining the source NeRF model using stylized images generated by a style-aligned image-to-image diffusion model. Given a target style prompt, we first generate perceptually similar multi-view images by leveraging a depth-conditioned diffusion model with an attention-sharing mechanism. Next, based on the stylized multi-view images, we propose to guide the style transfer process with the sliced Wasserstein loss based on the feature maps extracted from a pre-trained CNN model. Our pipeline consists of decoupled steps, allowing users to test various prompt ideas and preview the stylized 3D result before proceeding to the NeRF fine-tuning stage. We demonstrate that our method can transfer diverse artistic styles to real-world 3D scenes with competitive quality.