Epona's world model utilizes a multimodal spatiotemporal transformer to process historical driving context and employs a next-frame prediction DiT to generate the frame at T+1 and a trajectory planning DiT to forecast the future N-frame pose trajectory. By adopting a chain-of-forward strategy, Epona enables high-quality and long-horizon video generation with an autoregressive manner. This approach allows for minutes-long video generation, trajectory-controlled video generation, and generalization to diverse driving scenes.
Epona's experimental results demonstrate state-of-the-art performance with 7.4% FVD improvement and minutes longer prediction duration compared to prior works. The learned world model further serves as a real-time motion planner, outperforming strong end-to-end planners on NAVSIM benchmarks. Epona's ability to understand real-world traffic knowledge and predict future trajectories makes it a promising solution for autonomous driving applications. Its modular architecture and chain-of-forward training strategy enable high-quality and long-horizon video generation, making it a valuable tool for researchers and developers.