Granular Privacy Control for Geolocation with Vision Language Models
Ethan Mendes, Yang Chen, James Hays, Sauvik Das, Wei Xu, Alan Ritter
2024-07-08

Summary
This paper talks about the development of a new system called GPTGeoChat, which aims to improve privacy controls in Vision Language Models (VLMs) that can identify locations from images. It focuses on how these models can manage conversations about geolocation while protecting sensitive information.
What's the problem?
The main problem is that as VLMs become more advanced and are used in everyday applications, they pose privacy risks. These models can easily identify where a photo was taken or even pinpoint people in images, which could lead to unwanted sharing of personal location information. This is a serious concern for individuals who want to keep their location private.
What's the solution?
To address this issue, the authors created GPTGeoChat, a benchmark that tests how well VLMs can moderate conversations about geolocation. They collected 1,000 conversations where users asked about the locations of images. The researchers then evaluated various VLMs to see how effectively they could identify when too much location information was being revealed. They found that specially trained models could perform just as well as standard API-based models in recognizing when to limit location details at broader levels (like country or city), but more training was needed for specific details (like the name of a restaurant).
Why it matters?
This research is important because it helps ensure that as technology advances, people's privacy is still protected. By developing better moderation systems for VLMs, this work aims to prevent sensitive location information from being shared unintentionally, which is crucial for maintaining safety and privacy in an increasingly connected world.
Abstract
Vision Language Models (VLMs) are rapidly advancing in their capability to answer information-seeking questions. As these models are widely deployed in consumer applications, they could lead to new privacy risks due to emergent abilities to identify people in photos, geolocate images, etc. As we demonstrate, somewhat surprisingly, current open-source and proprietary VLMs are very capable image geolocators, making widespread geolocation with VLMs an immediate privacy risk, rather than merely a theoretical future concern. As a first step to address this challenge, we develop a new benchmark, GPTGeoChat, to test the ability of VLMs to moderate geolocation dialogues with users. We collect a set of 1,000 image geolocation conversations between in-house annotators and GPT-4v, which are annotated with the granularity of location information revealed at each turn. Using this new dataset, we evaluate the ability of various VLMs to moderate GPT-4v geolocation conversations by determining when too much location information has been revealed. We find that custom fine-tuned models perform on par with prompted API-based models when identifying leaked location information at the country or city level; however, fine-tuning on supervised data appears to be needed to accurately moderate finer granularities, such as the name of a restaurant or building.