The Geometry of Tokens in Internal Representations of Large Language Models

Karthik Viswanathan, Yuri Gardinazzi, Giada Panerai, Alberto Cazzaniga, Matteo Biagetti

2025-01-22

The Geometry of Tokens in Internal Representations of Large Language Models

Summary

This paper talks about a new way to make AI generate videos called MAGI. It's like teaching a computer to draw a comic strip frame by frame, but with moving pictures instead of still ones.

What's the problem?

Current methods for making AI generate videos have trouble creating long, smooth sequences that look natural. It's like trying to make a flip book animation where each page doesn't quite match up with the ones before and after it. This makes the videos look choppy or unrealistic, especially when trying to make longer videos.

What's the solution?

The researchers came up with a clever trick called Complete Teacher Forcing (CTF). It's like showing the AI a few frames of a video and asking it to predict what comes next, but always giving it clear, complete pictures to work from. They also used some special training techniques to help the AI learn better. This new method, MAGI, can create videos that are over 100 frames long, even when it only learned from 16-frame clips.

Why it matters?

This matters because it could make AI-generated videos look much more realistic and natural. Imagine being able to type in a description and have a computer create a whole movie scene for you. This technology could be used in video games, special effects for movies, or even in education to create visual explanations of complex topics. It's a big step towards making AI that can create longer, more coherent visual stories without needing humans to draw every frame.

Abstract

We investigate the relationship between the geometry of token embeddings and their role in the next token prediction within transformer models. An important aspect of this connection uses the notion of empirical measure, which encodes the distribution of token point clouds across transformer layers and drives the evolution of token representations in the mean-field interacting picture. We use metrics such as intrinsic dimension, neighborhood overlap, and cosine similarity to observationally probe these empirical measures across layers. To validate our approach, we compare these metrics to a dataset where the tokens are shuffled, which disrupts the syntactic and semantic structure. Our findings reveal a correlation between the geometric properties of token embeddings and the cross-entropy loss of next token predictions, implying that prompts with higher loss values have tokens represented in higher-dimensional spaces.

View Paper