Position: Privacy Is Not Just Memorization!

Niloofar Mireshghallah, Tianshi Li

2025-10-07

Position: Privacy Is Not Just Memorization!

Summary

This paper argues that the way we currently think about privacy risks with Large Language Models (LLMs) like ChatGPT is too focused on whether these models simply repeat information they were trained on. It claims there are many other, more practical and widespread privacy concerns that aren't being addressed.

What's the problem?

The main problem is that everyone is worried about LLMs memorizing and spitting out exact pieces of their training data. While that's a valid concern, it overshadows much bigger issues. These include how the data used to *create* the LLM was collected, how information is leaked when you're actually *using* the model, the privacy risks that come with LLMs acting as autonomous agents, and how easily these models can be used for surveillance. Current privacy rules and technical solutions aren't equipped to handle these broader threats.

What's the solution?

The researchers analyzed over 1,300 research papers on AI and privacy from the last decade to see what's being studied. They found that memorization gets a lot of attention, but the real privacy problems – those related to data collection, usage, and potential misuse – are largely ignored. They propose a new way of thinking about LLM privacy that considers not just the technology, but also the social and ethical implications of these models.

Why it matters?

This research is important because it highlights that we need to change our approach to LLM privacy. Simply trying to prevent memorization isn't enough. We need to consider the entire lifecycle of these models, from how data is gathered to how they're used in the real world, and involve experts from different fields to develop effective solutions. Ignoring these broader risks could lead to significant privacy violations and misuse of this powerful technology.

Abstract

The discourse on privacy risks in Large Language Models (LLMs) has disproportionately focused on verbatim memorization of training data, while a constellation of more immediate and scalable privacy threats remain underexplored. This position paper argues that the privacy landscape of LLM systems extends far beyond training data extraction, encompassing risks from data collection practices, inference-time context leakage, autonomous agent capabilities, and the democratization of surveillance through deep inference attacks. We present a comprehensive taxonomy of privacy risks across the LLM lifecycle -- from data collection through deployment -- and demonstrate through case studies how current privacy frameworks fail to address these multifaceted threats. Through a longitudinal analysis of 1,322 AI/ML privacy papers published at leading conferences over the past decade (2016--2025), we reveal that while memorization receives outsized attention in technical research, the most pressing privacy harms lie elsewhere, where current technical approaches offer little traction and viable paths forward remain unclear. We call for a fundamental shift in how the research community approaches LLM privacy, moving beyond the narrow focus of current technical solutions and embracing interdisciplinary approaches that address the sociotechnical nature of these emerging threats.

View Paper