The method works by first deriving an appearance-debiased temporal representation by measuring the distance between latents corresponding to consecutive frames. This highlights the implicit temporal structure predicted by the model. FlowMo then estimates motion coherence by measuring the patch-wise variance across the temporal dimension and guides the model to reduce this variance dynamically during sampling. This approach has been shown to significantly improve motion coherence without sacrificing visual quality or prompt alignment.
FlowMo has been evaluated on multiple text-to-video models, demonstrating its effectiveness in enhancing motion coherence. Qualitative comparisons with base models and other methods, such as FreeInit, have shown that FlowMo produces more coherent and realistic motion. The method is also easy to implement and can be used as a plug-and-play solution for enhancing the temporal fidelity of pre-trained video diffusion models. FlowMo's ability to improve motion coherence makes it a valuable tool for video generation applications.