Inferix: A Block-Diffusion based Next-Generation Inference Engine for World Simulation
Inferix Team, Tianyu Feng, Yizeng Han, Jiahao He, Yuanyu He, Xi Lin, Teng Liu, Hanfeng Lu, Jiasheng Tang, Wei Wang, Zhiyuan Wang, Jichao Wu, Mingyang Yang, Yinghao Yu, Zeyu Zhang, Bohan Zhuang
2025-11-27
Summary
This paper introduces Inferix, a new system designed to efficiently create and interact with realistic, long-form videos – essentially, simulated worlds. It focuses on improving how these 'world models' are built and used, moving beyond current approaches that primarily rely on large language models for vision tasks.
What's the problem?
Creating high-quality, long videos that look realistic and respond logically to interactions is really hard. Existing methods either struggle with maintaining consistency over time, are too slow, or can't handle videos of significant length. Standard video generation techniques, like diffusion models, have limitations when it comes to generating extended, coherent sequences, and current systems aren't optimized for the specific demands of building these complex simulated worlds.
What's the solution?
The researchers developed Inferix, which uses a technique called 'semi-autoregressive decoding'. Think of it like building a video piece by piece, but with a smart system that remembers what happened earlier to make sure everything flows together smoothly. It’s like writing a story – you build on previous sentences to create a coherent narrative. Inferix also includes tools for streaming these videos in real-time, analyzing their performance, and benchmarking them against a new standard called LV-Bench, which specifically tests long video generation.
Why it matters?
This work is important because it could unlock more advanced AI systems that can truly understand and interact with the visual world. Instead of just recognizing objects in images, these 'world models' could allow AI to predict what will happen next, plan actions, and learn from experience in a simulated environment. This has huge potential for things like robotics, game development, and creating more intelligent virtual assistants, and represents a shift away from relying solely on language models for visual understanding.
Abstract
World models serve as core simulators for fields such as agentic AI, embodied AI, and gaming, capable of generating long, physically realistic, and interactive high-quality videos. Moreover, scaling these models could unlock emergent capabilities in visual perception, understanding, and reasoning, paving the way for a new paradigm that moves beyond current LLM-centric vision foundation models. A key breakthrough empowering them is the semi-autoregressive (block-diffusion) decoding paradigm, which merges the strengths of diffusion and autoregressive methods by generating video tokens in block-applying diffusion within each block while conditioning on previous ones, resulting in more coherent and stable video sequences. Crucially, it overcomes limitations of standard video diffusion by reintroducing LLM-style KV Cache management, enabling efficient, variable-length, and high-quality generation. Therefore, Inferix is specifically designed as a next-generation inference engine to enable immersive world synthesis through optimized semi-autoregressive decoding processes. This dedicated focus on world simulation distinctly sets it apart from systems engineered for high-concurrency scenarios (like vLLM or SGLang) and from classic video diffusion models (such as xDiTs). Inferix further enhances its offering with interactive video streaming and profiling, enabling real-time interaction and realistic simulation to accurately model world dynamics. Additionally, it supports efficient benchmarking through seamless integration of LV-Bench, a new fine-grained evaluation benchmark tailored for minute-long video generation scenarios. We hope the community will work together to advance Inferix and foster world model exploration.