TEDRA: Text-based Editing of Dynamic and Photoreal Actors

Basavaraj Sunagad, Heming Zhu, Mohit Mendiratta, Adam Kortylewski, Christian Theobalt, Marc Habermann

2024-08-29

TEDRA: Text-based Editing of Dynamic and Photoreal Actors

Summary

This paper introduces TEDRA, a new method that allows users to edit 3D avatars based on text descriptions while keeping the avatars looking realistic and dynamic.

What's the problem?

Creating lifelike 3D avatars from videos of real people has become easier, but editing these avatars, especially their clothing styles using simple text commands, remains a challenge. Users want a way to make detailed changes without needing advanced technical skills.

What's the solution?

TEDRA solves this problem by training a model that creates a high-quality digital version of a real actor and then fine-tuning it with various video frames. This process allows for detailed text-based edits while maintaining the avatar's realism and movement. The method uses a unique sampling technique and a strategy to ensure high-quality results during the editing process.

Why it matters?

This development is important because it makes it easier for anyone to customize digital avatars, which can be used in gaming, virtual reality, and online interactions. By simplifying the editing process, TEDRA opens up new possibilities for creativity and personalization in digital environments.

Abstract

Over the past years, significant progress has been made in creating photorealistic and drivable 3D avatars solely from videos of real humans. However, a core remaining challenge is the fine-grained and user-friendly editing of clothing styles by means of textual descriptions. To this end, we present TEDRA, the first method allowing text-based edits of an avatar, which maintains the avatar's high fidelity, space-time coherency, as well as dynamics, and enables skeletal pose and view control. We begin by training a model to create a controllable and high-fidelity digital replica of the real actor. Next, we personalize a pretrained generative diffusion model by fine-tuning it on various frames of the real character captured from different camera angles, ensuring the digital representation faithfully captures the dynamics and movements of the real person. This two-stage process lays the foundation for our approach to dynamic human avatar editing. Utilizing this personalized diffusion model, we modify the dynamic avatar based on a provided text prompt using our Personalized Normal Aligned Score Distillation Sampling (PNA-SDS) within a model-based guidance framework. Additionally, we propose a time step annealing strategy to ensure high-quality edits. Our results demonstrate a clear improvement over prior work in functionality and visual quality.

View Paper