ContextCite: Attributing Model Generation to Context

Benjamin Cohen-Wang, Harshay Shah, Kristian Georgiev, Aleksander Madry

2024-09-04

ContextCite: Attributing Model Generation to Context

Summary

This paper talks about ContextCite, a new method that helps understand how language models use context when generating responses, and whether those responses are based on the provided information or not.

What's the problem?

When language models generate text, it's often unclear if their statements are truly based on the context given to them. This can lead to misunderstandings, where the model might misinterpret the context or even make things up entirely. Researchers needed a way to figure out which parts of the context influenced the model's outputs.

What's the solution?

ContextCite introduces a straightforward and scalable approach to identify which parts of the context led to a specific generated statement. It works by analyzing the input context and attributing parts of it to the model's responses. The method can be applied to any existing language model and has three main applications: verifying generated statements, improving response quality by refining the context, and detecting harmful manipulations in the model's outputs.

Why it matters?

This research is important because it enhances our understanding of how AI language models operate. By making it easier to track how context influences their responses, ContextCite can help improve the reliability of these models in various applications, such as customer service, education, and content creation.

Abstract

How do language models use information provided as context when generating a response? Can we infer whether a particular generated statement is actually grounded in the context, a misinterpretation, or fabricated? To help answer these questions, we introduce the problem of context attribution: pinpointing the parts of the context (if any) that led a model to generate a particular statement. We then present ContextCite, a simple and scalable method for context attribution that can be applied on top of any existing language model. Finally, we showcase the utility of ContextCite through three applications: (1) helping verify generated statements (2) improving response quality by pruning the context and (3) detecting poisoning attacks. We provide code for ContextCite at https://github.com/MadryLab/context-cite.

View Paper