Visual Generation Without Guidance

Huayu Chen, Kai Jiang, Kaiwen Zheng, Jianfei Chen, Hang Su, Jun Zhu

2025-01-28

Summary

This paper talks about finding the best balance between the number of parameters and the amount of computation needed in large language models, specifically focusing on a type of model called Mixture-of-Experts (MoEs). The researchers explore how making these models 'sparse' (using fewer active parts) affects their performance and efficiency.

What's the problem?

As language models get bigger and more powerful, they need more computer memory and processing power. This makes them expensive and difficult to use in real-world applications. While we know that bigger models generally perform better, we don't fully understand how to balance the number of parameters (the model's size) with the amount of computation needed for each task.

What's the solution?

The researchers studied Mixture-of-Experts models, which can increase in size without needing proportionally more computation. They experimented with different levels of 'sparsity' - how many parts of the model are inactive at any given time. By testing various configurations, they found that there's an optimal level of sparsity that improves both how efficiently the model trains and how well it performs on tasks.

Why it matters?

This research matters because it helps us build better and more efficient AI language models. By finding the right balance between model size and computation, we can create powerful AI systems that are faster and cheaper to run. This could make advanced AI more accessible and practical for various applications, from improving search engines to creating better virtual assistants. It also gives us a deeper understanding of how these complex models work, which can guide future improvements in AI technology.

Abstract

Classifier-Free Guidance (CFG) has been a default technique in various visual generative models, yet it requires inference from both conditional and unconditional models during sampling. We propose to build visual models that are free from guided sampling. The resulting algorithm, Guidance-Free Training (GFT), matches the performance of CFG while reducing sampling to a single model, halving the computational cost. Unlike previous distillation-based approaches that rely on pretrained CFG networks, GFT enables training directly from scratch. GFT is simple to implement. It retains the same maximum likelihood objective as CFG and differs mainly in the parameterization of conditional models. Implementing GFT requires only minimal modifications to existing codebases, as most design choices and hyperparameters are directly inherited from CFG. Our extensive experiments across five distinct visual models demonstrate the effectiveness and versatility of GFT. Across domains of diffusion, autoregressive, and masked-prediction modeling, GFT consistently achieves comparable or even lower FID scores, with similar diversity-fidelity trade-offs compared with CFG baselines, all while being guidance-free. Code will be available at https://github.com/thu-ml/GFT.

View Paper