From Data to Rewards: a Bilevel Optimization Perspective on Maximum Likelihood Estimation
Abdelhakim Benechehab, Gabriel Singer, Corentin Léger, Youssef Attia El Hili, Giuseppe Paolo, Albert Thomas, Maurizio Filippone, Balázs Kégl
2025-10-14
Summary
This paper explores a new way to train generative models, which are the engines behind things like creating realistic images, writing text, and more. It focuses on improving how these models learn and avoid problems like forgetting previously learned information.
What's the problem?
Traditionally, generative models are trained using a method called Maximum Likelihood Estimation, but this method doesn't always lead to the best results, especially when it comes to adapting to new situations or remembering what they've already learned. Another approach, using 'rewards' like in video games, works better but requires someone to *define* those rewards, which isn't always possible when you just have a bunch of data. So, the core issue is how to train these models effectively when you only have good data, not explicit instructions on what's 'good' output.
What's the solution?
The researchers propose a clever technique called Bilevel Optimization. Think of it like having two levels of learning. The outer level figures out what 'reward' the model *should* be aiming for, and the inner level trains the model to actually achieve that reward. This way, the model learns what's good without needing a human to tell it directly. They also did some math to understand how this method works and showed it can be applied to different types of problems.
Why it matters?
This work is important because it offers a way to train powerful generative models even when it's hard to define clear goals or rewards. This opens up possibilities for creating more adaptable and reliable AI systems in areas like image generation, natural language processing, and even robotics, where providing explicit rewards is difficult or impossible.
Abstract
Generative models form the backbone of modern machine learning, underpinning state-of-the-art systems in text, vision, and multimodal applications. While Maximum Likelihood Estimation has traditionally served as the dominant training paradigm, recent work have highlighted its limitations, particularly in generalization and susceptibility to catastrophic forgetting compared to Reinforcement Learning techniques, such as Policy Gradient methods. However, these approaches depend on explicit reward signals, which are often unavailable in practice, leaving open the fundamental problem of how to align generative models when only high-quality datasets are accessible. In this work, we address this challenge via a Bilevel Optimization framework, where the reward function is treated as the optimization variable of an outer-level problem, while a policy gradient objective defines the inner-level. We then conduct a theoretical analysis of this optimization problem in a tractable setting and extract insights that, as we demonstrate, generalize to applications such as tabular classification and model-based reinforcement learning. We release the code at https://github.com/abenechehab/nll_to_po .