WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

Hainuo Wang, Mingjia Li, Xiaojie Guo

2026-03-18

WiT: Waypoint Diffusion Transformers via Trajectory Conflict Navigation

Summary

This paper introduces a new method, Waypoint Diffusion Transformers (WiT), for generating images. It improves upon existing techniques that work directly with image pixels by addressing a key issue that causes problems during the image creation process.

What's the problem?

Current image generation models that operate directly on pixels, without simplifying the image first, struggle because the path the model takes to create an image can get tangled and inefficient. Imagine trying to navigate a crowded intersection – there are many possible routes, and they can conflict with each other. This happens because the model has trouble maintaining a consistent 'understanding' of what it's creating as it goes from the initial noise to the final image, leading to slower and less effective results.

What's the solution?

WiT solves this by introducing 'waypoints' – essentially intermediate steps or landmarks that guide the image generation process. The model uses a pre-trained vision model to project these waypoints, which represent semantic meaning, and then breaks down the image creation into two parts: going from the starting noise to the waypoint, and then from the waypoint to the final pixel image. A lightweight generator figures out these waypoints as the image is being created, and they help steer the process, making it more focused and efficient. It uses a technique called 'Just-Pixel AdaLN' to continuously adjust the image based on these waypoints.

Why it matters?

This research is important because it significantly speeds up the image generation process. The WiT method is shown to be 2.2 times faster than other similar techniques while still producing high-quality images. This means we can create images more quickly and efficiently, which is crucial for many applications like art generation, image editing, and scientific visualization.

Abstract

While recent Flow Matching models avoid the reconstruction bottlenecks of latent autoencoders by operating directly in pixel space, the lack of semantic continuity in the pixel manifold severely intertwines optimal transport paths. This induces severe trajectory conflicts near intersections, yielding sub-optimal solutions. Rather than bypassing this issue via information-lossy latent representations, we directly untangle the pixel-space trajectories by proposing Waypoint Diffusion Transformers (WiT). WiT factorizes the continuous vector field via intermediate semantic waypoints projected from pre-trained vision models. It effectively disentangles the generation trajectories by breaking the optimal transport into prior-to-waypoint and waypoint-to-pixel segments. Specifically, during the iterative denoising process, a lightweight generator dynamically infers these intermediate waypoints from the current noisy state. They then continuously condition the primary diffusion transformer via the Just-Pixel AdaLN mechanism, steering the evolution towards the next state, ultimately yielding the final RGB pixels. Evaluated on ImageNet 256x256, WiT beats strong pixel-space baselines, accelerating JiT training convergence by 2.2x. Code will be publicly released at https://github.com/hainuo-wang/WiT.git.

View Paper