Context Engineering for Trustworthiness: Rescorla Wagner Steering Under Mixed and Inappropriate Contexts
Rushi Wang, Jiateng Liu, Cheng Qian, Yifan Shen, Yanzhou Pan, Zhaozhuo Xu, Ahmed Abbasi, Heng Ji, Denghui Zhang
2025-09-15
Summary
This paper investigates how large language models, or LLMs, handle information from outside sources, specifically when those sources contain both helpful and harmful content. It finds that LLMs tend to overly focus on the less common information, even if that information is inappropriate, and then proposes a way to train the models to ignore the bad stuff.
What's the problem?
LLMs are getting better at using external information to give more informed answers, but real-world information isn't always clean. Often, useful information is mixed with inappropriate or unreliable content. The core issue is understanding *how* LLMs decide what to pay attention to when presented with this mixed bag of information, and whether they can be tricked into prioritizing harmful content simply because it's less frequent.
What's the solution?
The researchers created a testing environment with questions paired with realistic contexts containing both good and bad information. They used a model inspired by how animals learn associations – the Rescorla-Wagner model – to measure how LLMs weigh different pieces of information. They discovered LLMs give too much weight to less common information. To fix this, they developed a training method called RW-Steering, which involves two steps of fine-tuning to help the model internally recognize and disregard inappropriate signals. This method doesn't require a lot of specific examples of bad content, making it adaptable to different situations.
Why it matters?
This research is important because it highlights a significant safety vulnerability in LLMs. If a model easily prioritizes even small amounts of harmful information, it could lead to biased, inaccurate, or even dangerous responses. RW-Steering offers a promising solution to make LLMs more robust and reliable when dealing with the messy reality of information found online, ultimately improving their safety for real-world applications.
Abstract
Incorporating external context can significantly enhance the response quality of Large Language Models (LLMs). However, real-world contexts often mix relevant information with disproportionate inappropriate content, posing reliability risks. How do LLMs process and prioritize mixed context? To study this, we introduce the Poisoned Context Testbed, pairing queries with real-world contexts containing relevant and inappropriate content. Inspired by associative learning in animals, we adapt the Rescorla-Wagner (RW) model from neuroscience to quantify how competing contextual signals influence LLM outputs. Our adapted model reveals a consistent behavioral pattern: LLMs exhibit a strong tendency to incorporate information that is less prevalent in the context. This susceptibility is harmful in real-world settings, where small amounts of inappropriate content can substantially degrade response quality. Empirical evaluations on our testbed further confirm this vulnerability. To tackle this, we introduce RW-Steering, a two-stage finetuning-based approach that enables the model to internally identify and ignore inappropriate signals. Unlike prior methods that rely on extensive supervision across diverse context mixtures, RW-Steering generalizes robustly across varying proportions of inappropriate content. Experiments show that our best fine-tuned model improves response quality by 39.8% and reverses the undesirable behavior curve, establishing RW-Steering as a robust, generalizable context engineering solution for improving LLM safety in real-world use.