MakeAnything: Harnessing Diffusion Transformers for Multi-Domain Procedural Sequence Generation

Yiren Song, Cheng Liu, Mike Zheng Shou

2025-02-04

MakeAnything: Harnessing Diffusion Transformers for Multi-Domain
Procedural Sequence Generation

Summary

This paper talks about MakeAnything, a new AI system that creates step-by-step instructions for different tasks, like crafting, cooking, or building things. It uses advanced technology to make these instructions logical and consistent across many different areas.

What's the problem?

Creating detailed step-by-step tutorials with AI is really hard because there isn’t enough data for many tasks, and it’s difficult to keep the instructions clear and visually consistent between steps. AI also struggles to work well across multiple domains, like switching from crafting to cooking while staying accurate.

What's the solution?

The researchers built MakeAnything, which uses a special type of AI called diffusion transformers. They also created a large dataset with over 24,000 examples of step-by-step processes for 21 different tasks. MakeAnything uses a technique called asymmetric LoRA to balance generalization and task-specific performance while keeping the instructions consistent. Additionally, they developed ReCraft, a model that can take a single image and break it down into the steps needed to create it.

Why it matters?

This research is important because it makes AI better at generating tutorials for a wide range of tasks, helping people learn new skills or complete projects more easily. It sets new standards for how AI can handle complex step-by-step processes and opens up possibilities for applications in education, design, and creative industries.

Abstract

A hallmark of human intelligence is the ability to create complex artifacts through structured multi-step processes. Generating procedural tutorials with AI is a longstanding but challenging goal, facing three key obstacles: (1) scarcity of multi-task procedural datasets, (2) maintaining logical continuity and visual consistency between steps, and (3) generalizing across multiple domains. To address these challenges, we propose a multi-domain dataset covering 21 tasks with over 24,000 procedural sequences. Building upon this foundation, we introduce MakeAnything, a framework based on the diffusion transformer (DIT), which leverages fine-tuning to activate the in-context capabilities of DIT for generating consistent procedural sequences. We introduce asymmetric low-rank adaptation (LoRA) for image generation, which balances generalization capabilities and task-specific performance by freezing encoder parameters while adaptively tuning decoder layers. Additionally, our ReCraft model enables image-to-process generation through spatiotemporal consistency constraints, allowing static images to be decomposed into plausible creation sequences. Extensive experiments demonstrate that MakeAnything surpasses existing methods, setting new performance benchmarks for procedural generation tasks.

View Paper