Cognitively Inspired Energy-Based World Models

Alexi Gladstone, Ganesh Nanduru, Md Mofijul Islam, Aman Chadha, Jundong Li, Tariq Iqbal

2024-06-14

Cognitively Inspired Energy-Based World Models

Summary

This paper introduces Energy-Based World Models (EBWM), a new approach to training AI models that aims to mimic human cognitive abilities. It focuses on improving how these models predict future events and make decisions based on those predictions.

What's the problem?

Traditional AI models, like Large Language Models (LLMs) and autoregressive models used in computer vision, predict the next part of a sequence (like the next word in a sentence or the next pixel in an image) but do not think or reason like humans. They lack the ability to adapt their thinking based on their predictions, evaluate how likely those predictions are, or change how much time they spend on making predictions. This limits their effectiveness in complex reasoning tasks.

What's the solution?

To address these issues, the authors propose EBWM, which uses an Energy-Based Model (EBM) to assess how well a predicted future state matches the current context. This method allows the model to incorporate human-like cognitive processes, such as evaluating the plausibility of its predictions and adjusting its reasoning time based on the task's complexity. Additionally, they developed the Energy-Based Transformer (EBT), which is designed to work well with EBMs and improve performance in both computer vision and natural language processing tasks.

Why it matters?

This research is important because it aims to create AI models that can think more like humans, particularly in terms of reasoning and planning. By developing models that can evaluate their own predictions and adjust their thinking processes, we can improve AI's ability to handle complex tasks in real-world applications, such as decision-making, problem-solving, and understanding context in communication.

Abstract

One of the predominant methods for training world models is autoregressive prediction in the output space of the next element of a sequence. In Natural Language Processing (NLP), this takes the form of Large Language Models (LLMs) predicting the next token; in Computer Vision (CV), this takes the form of autoregressive models predicting the next frame/token/pixel. However, this approach differs from human cognition in several respects. First, human predictions about the future actively influence internal cognitive processes. Second, humans naturally evaluate the plausibility of predictions regarding future states. Based on this capability, and third, by assessing when predictions are sufficient, humans allocate a dynamic amount of time to make a prediction. This adaptive process is analogous to System 2 thinking in psychology. All these capabilities are fundamental to the success of humans at high-level reasoning and planning. Therefore, to address the limitations of traditional autoregressive models lacking these human-like capabilities, we introduce Energy-Based World Models (EBWM). EBWM involves training an Energy-Based Model (EBM) to predict the compatibility of a given context and a predicted future state. In doing so, EBWM enables models to achieve all three facets of human cognition described. Moreover, we developed a variant of the traditional autoregressive transformer tailored for Energy-Based models, termed the Energy-Based Transformer (EBT). Our results demonstrate that EBWM scales better with data and GPU Hours than traditional autoregressive transformers in CV, and that EBWM offers promising early scaling in NLP. Consequently, this approach offers an exciting path toward training future models capable of System 2 thinking and intelligently searching across state spaces.

View Paper