OmniLayout: Enabling Coarse-to-Fine Learning with LLMs for Universal Document Layout Generation
Hengrui Kang, Zhuangcheng Gu, Zhiyuan Zhao, Zichen Wen, Bin Wang, Weijia Li, Conghui He
2025-10-31
Summary
This paper focuses on creating realistic document layouts using artificial intelligence, specifically a system that can *generate* how a document looks rather than just understand existing layouts.
What's the problem?
Current AI systems are good at analyzing the layout of documents, like figuring out where text and images are on a page, but they struggle to *create* new, diverse layouts. Most of the data used to train these systems focuses on very structured documents like academic papers, meaning they aren't very good at handling the more varied layouts found in things like newspapers or magazines. There's a lack of large, diverse datasets to train these systems effectively, and existing methods have trouble with complex layouts and long documents.
What's the solution?
The researchers created a new dataset called OmniLayout-1M, which contains a million examples of different document layouts from six common types. They also developed a new AI model, OmniLayout-LLM, which has a two-step learning process. First, it learns general layout rules from the large OmniLayout-1M dataset, and then it applies that knowledge to specific types of documents with more detailed instructions. This model is relatively small at 0.5 billion parameters, but it performs very well.
Why it matters?
This work is important because it pushes the field of Document AI forward by tackling the challenging problem of document layout generation. By creating a large, diverse dataset and a new model, the researchers have significantly improved the ability of AI to create realistic and varied document layouts, which could be useful for automating document creation, improving accessibility, and more.
Abstract
Document AI has advanced rapidly and is attracting increasing attention. Yet, while most efforts have focused on document layout analysis (DLA), its generative counterpart, document layout generation, remains underexplored. A major obstacle lies in the scarcity of diverse layouts: academic papers with Manhattan-style structures dominate existing studies, while open-world genres such as newspapers and magazines remain severely underrepresented. To address this gap, we curate OmniLayout-1M, the first million-scale dataset of diverse document layouts, covering six common document types and comprising contemporary layouts collected from multiple sources. Moreover, since existing methods struggle in complex domains and often fail to arrange long sequences coherently, we introduce OmniLayout-LLM, a 0.5B model with designed two-stage Coarse-to-Fine learning paradigm: 1) learning universal layout principles from OmniLayout-1M with coarse category definitions, and 2) transferring the knowledge to a specific domain with fine-grained annotations. Extensive experiments demonstrate that our approach achieves strong performance on multiple domains in M^{6}Doc dataset, substantially surpassing both existing layout generation experts and several latest general-purpose LLMs. Our code, models, and dataset will be publicly released.