CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

Ruoxuan Zhang, Bin Wen, Hongxia Xie, Yi Yao, Songhan Zuo, Jian-Yu Jiang-Lin, Hong-Han Shuai, Wen-Huang Cheng

2025-12-04

CookAnything: A Framework for Flexible and Consistent Multi-Step Recipe Image Generation

Summary

This paper introduces CookAnything, a new system that creates a series of images to visually show how to follow a recipe, no matter how long or complex it is.

What's the problem?

Current AI image generators are really good at making pictures from text, but they struggle with tasks that require multiple steps in a specific order, like illustrating a recipe. Existing methods also create a set number of images, which doesn't work well for recipes with varying lengths – some need more steps shown than others.

What's the solution?

The researchers developed CookAnything, which uses a special type of AI called a diffusion model. It works in three main ways: first, it connects each step of the recipe to a specific part of the image; second, it uses a clever coding system to make sure the images flow together logically and look different enough from each other; and third, it ensures that ingredients stay consistent throughout the different steps of the recipe. Basically, it carefully controls the image generation process to match the recipe's instructions.

Why it matters?

This work is important because it allows for the automatic creation of clear, step-by-step visual guides for recipes and other instructions. This could be used to make instructional videos, educational materials, or even help people learn new skills, and it’s a big step forward in getting AI to understand and visually represent complex processes.

Abstract

Cooking is a sequential and visually grounded activity, where each step such as chopping, mixing, or frying carries both procedural logic and visual semantics. While recent diffusion models have shown strong capabilities in text-to-image generation, they struggle to handle structured multi-step scenarios like recipe illustration. Additionally, current recipe illustration methods are unable to adjust to the natural variability in recipe length, generating a fixed number of images regardless of the actual instructions structure. To address these limitations, we present CookAnything, a flexible and consistent diffusion-based framework that generates coherent, semantically distinct image sequences from textual cooking instructions of arbitrary length. The framework introduces three key components: (1) Step-wise Regional Control (SRC), which aligns textual steps with corresponding image regions within a single denoising process; (2) Flexible RoPE, a step-aware positional encoding mechanism that enhances both temporal coherence and spatial diversity; and (3) Cross-Step Consistency Control (CSCC), which maintains fine-grained ingredient consistency across steps. Experimental results on recipe illustration benchmarks show that CookAnything performs better than existing methods in training-based and training-free settings. The proposed framework supports scalable, high-quality visual synthesis of complex multi-step instructions and holds significant potential for broad applications in instructional media, and procedural content creation.

View Paper