SIFT: Grounding LLM Reasoning in Contexts via Stickers
Zihao Zeng, Xuyao Huang, Boxiu Li, Zhijie Deng
2025-02-24
Summary
This paper talks about SIFT, a new method to help AI language models better understand and use information from the context they're given, making their reasoning more accurate and reliable.
What's the problem?
AI language models sometimes misunderstand important details in the information they're given, which can lead to mistakes in their calculations and reasoning. For example, they might not correctly interpret phrases like '10 dollars per kilo,' causing errors in their responses.
What's the solution?
The researchers created SIFT (Stick to the Facts), which uses a 'Sticker' to highlight key information in the context. SIFT compares two predictions: one from the original question and one with the highlighted information. If they're different, SIFT refines the Sticker to make sure the AI focuses on the most important facts and reasons more accurately.
Why it matters?
This matters because it makes AI language models more trustworthy and accurate, especially for tasks that require precise reasoning like math problems. SIFT improved the performance of a top AI model on a challenging math test, showing it can help AI systems make fewer mistakes and provide more reliable answers in various fields.
Abstract
This paper identifies the misinterpretation of the context can be a significant issue during the reasoning process of large language models, spanning from smaller models like Llama3.2-3B-Instruct to cutting-edge ones like DeepSeek-R1. For example, in the phrase "10 dollars per kilo," LLMs might not recognize that "per" means "for each," leading to calculation errors. We introduce a novel, post-training approach called **Stick to the Facts (SIFT)** to tackle this. SIFT leverages increasing inference-time compute to ground LLM reasoning in contexts. At the core of SIFT lies the *Sticker*, which is generated by the model itself to explicitly emphasize the key information within the context. Given the curated Sticker, SIFT generates two predictions -- one from the original query and one from the query augmented with the Sticker. If they differ, the Sticker is sequentially refined via *forward* optimization (to better align the extracted facts with the query) and *inverse* generation (to conform with the model's inherent tendencies) for more faithful reasoning outcomes. Studies across diverse models (from 3B to 100B+) and benchmarks (e.g., GSM8K, MATH-500) reveal consistent performance improvements. Notably, SIFT improves the pass@1 accuracy of DeepSeek-R1 on AIME2024 from 78.33% to **85.67**%, establishing a new state-of-the-art in the open-source community. The code is available at https://github.com/zhijie-group/SIFT.