Feasible Learning

Juan Ramirez, Ignacio Hounie, Juan Elenter, Jose Gallego-Posada, Meraj Hashemizadeh, Alejandro Ribeiro, Simon Lacoste-Julien

2025-01-28

Summary

This paper talks about how encoder-decoder models, a type of AI language model, can be more efficient and effective than the currently popular decoder-only models, especially for smaller AI systems with fewer parameters. The researchers found ways to make these encoder-decoder models work better, challenging the idea that bigger, decoder-only models are always best.

What's the problem?

Recently, big decoder-only language models like GPT have become very popular, making people forget about encoder-decoder models. However, for smaller AI systems that need to run on devices with limited resources, like phones or small computers, we don't know which type of model works best. Also, encoder-decoder models haven't been able to do some tasks as well as the big decoder-only models.

What's the solution?

The researchers carefully compared encoder-decoder and decoder-only models on different types of computers and devices. They found that for smaller models, encoder-decoder designs work much faster and more efficiently. To make these models even better, they created a new way to teach encoder-decoder models using knowledge from bigger decoder-only models. They also combined this with other modern AI techniques to further improve performance.

Why it matters?

This research matters because it shows a way to create AI language models that can work well on everyday devices without needing powerful computers. This could lead to better AI assistants on phones, smarter home devices, and more accessible AI technology for everyone. It also challenges the idea that we always need bigger AI models to get better results, which could lead to more efficient and environmentally friendly AI development in the future.

Abstract

We introduce Feasible Learning (FL), a sample-centric learning paradigm where models are trained by solving a feasibility problem that bounds the loss for each training sample. In contrast to the ubiquitous Empirical Risk Minimization (ERM) framework, which optimizes for average performance, FL demands satisfactory performance on every individual data point. Since any model that meets the prescribed performance threshold is a valid FL solution, the choice of optimization algorithm and its dynamics play a crucial role in shaping the properties of the resulting solutions. In particular, we study a primal-dual approach which dynamically re-weights the importance of each sample during training. To address the challenge of setting a meaningful threshold in practice, we introduce a relaxation of FL that incorporates slack variables of minimal norm. Our empirical analysis, spanning image classification, age regression, and preference optimization in large language models, demonstrates that models trained via FL can learn from data while displaying improved tail behavior compared to ERM, with only a marginal impact on average performance.

View Paper