Built on a novel Asymmetric Diffusion Transformer architecture, Mochi 1 is the largest video generative model openly released, with 10 billion parameters trained from scratch. It employs an efficient video VAE that compresses video data significantly, allowing the model to run effectively in community environments. The model's architecture smartly balances text and visual processing with multi-modal self-attention mechanisms, ensuring that the generated videos are both visually compelling and contextually accurate. Additionally, Mochi 1 uses a single T5-XXL language model to encode prompts, supporting complex reasoning over long video token contexts with 3D attention to capture space and time dimensions in video generation.
Mochi 1 is freely accessible through a hosted playground where users can try video generation from their own prompts in 480p resolution. Genmo plans to release an HD version later, supporting 720p with enhanced fidelity and smoother motion. The project is positioned as a research and creative tool with applications in entertainment, advertising, education, robotics, and synthetic data generation. Ongoing development aims to improve image-to-video capabilities and fine control over output styling. Mochi 1 underscores Genmo's mission to advance AI-driven creativity by enabling open research and fostering a community around video generation technologies.