Promote, Suppress, Iterate: How Language Models Answer One-to-Many Factual Queries
Tianyi Lorena Yan, Robin Jia
2025-03-11
Summary
This paper talks about how AI language models list multiple correct answers (like naming all cities in a country) without repeating themselves, by first remembering all possible answers and then blocking ones they’ve already said.
What's the problem?
When AI models need to list multiple related facts (e.g., cities in Italy), they sometimes repeat answers or miss some because they struggle to track what they’ve already said while recalling new information.
What's the solution?
The model uses a two-step process: it activates all correct answers in its memory first, then blocks already-used answers by focusing on previous responses and reducing their importance in later steps. Tools like 'Token Lens' helped researchers see how the AI tracks and suppresses repeated answers.
Why it matters?
Understanding this helps improve AI assistants, search engines, and chatbots that need to provide complete, non-repeating answers to questions like 'list ingredients for pizza' or 'name historical events in 1800s.'
Abstract
To answer one-to-many factual queries (e.g., listing cities of a country), a language model (LM) must simultaneously recall knowledge and avoid repeating previous answers. How are these two subtasks implemented and integrated internally? Across multiple datasets and models, we identify a promote-then-suppress mechanism: the model first recalls all answers, and then suppresses previously generated ones. Specifically, LMs use both the subject and previous answer tokens to perform knowledge recall, with attention propagating subject information and MLPs promoting the answers. Then, attention attends to and suppresses previous answer tokens, while MLPs amplify the suppression signal. Our mechanism is corroborated by extensive experimental evidence: in addition to using early decoding and causal tracing, we analyze how components use different tokens by introducing both Token Lens, which decodes aggregated attention updates from specified tokens, and a knockout method that analyzes changes in MLP outputs after removing attention to specified tokens. Overall, we provide new insights into how LMs' internal components interact with different input tokens to support complex factual recall. Code is available at https://github.com/Lorenayannnnn/how-lms-answer-one-to-many-factual-queries.