< Explain other AI papers

Better Language Model Inversion by Compactly Representing Next-Token Distributions

Murtaza Nazir, Matthew Finlayson, John X. Morris, Xiang Ren, Swabha Swayamdipta

2025-06-23

Better Language Model Inversion by Compactly Representing Next-Token
  Distributions

Summary

This paper talks about a new method called Prompt Inversion from Logprob Sequences (PILS), which recovers hidden prompts in language models by analyzing the model's next-word predictions in a compact way.

What's the problem?

The problem is that it can be very difficult to figure out the original hidden instructions or prompts given to a language model just by looking at its outputs, and previous techniques had low success rates.

What's the solution?

The researchers found that the model's predictions about the next words lie in a simpler, low-dimensional space, which lets them compress and analyze this information better. Using PILS, they can recover the hidden prompts much more accurately and the method works well even on new types of prompts.

Why it matters?

This matters because it reveals how much information language models leak through their outputs, which has big implications for privacy and security, and can help improve the accountability of AI systems.

Abstract

A new method called Prompt Inversion from Logprob Sequences (PILS) recovers hidden prompts in language models by analyzing the low-dimensional subspace of the model's next-token probabilities, achieving higher recovery rates and better generalization than previous methods.