< Explain other AI papers

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Yifan Lan, Yuanpu Cao, Weitong Zhang, Lu Lin, Jinghui Chen

2025-09-17

Phi: Preference Hijacking in Multi-modal Large Language Models at Inference Time

Summary

This paper explores a new security vulnerability in Multimodal Large Language Models (MLLMs), which are AI systems that can understand both text and images. It shows how attackers can subtly influence the responses these models give by manipulating the images they are shown.

What's the problem?

MLLMs are becoming more popular, but they're also potentially unsafe. While we worry about them generating obviously harmful content, this paper highlights a more sneaky problem: attackers can change what the model *prefers* to say, leading to biased or misleading answers without the responses being overtly offensive. It's hard to detect because the responses still seem reasonable and relevant to the situation.

What's the solution?

The researchers developed a technique called 'Preference Hijacking' (Phi). This method involves adding a carefully crafted, almost invisible change to an image. When the MLLM sees this altered image, it shifts its preferences and starts generating responses that align with what the attacker wants, even if those responses aren't truthful or neutral. Importantly, this attack happens when the model is being *used* – it doesn't require changing the model itself, and the same image alteration can work on different images to achieve the desired bias.

Why it matters?

This research is important because it reveals a hidden way to manipulate powerful AI systems. It demonstrates that even if an MLLM isn't explicitly programmed to be biased, an attacker can still influence its output. This has implications for how we trust and deploy these models in real-world applications, and it emphasizes the need for better security measures to protect against these kinds of subtle attacks.

Abstract

Recently, Multimodal Large Language Models (MLLMs) have gained significant attention across various domains. However, their widespread adoption has also raised serious safety concerns. In this paper, we uncover a new safety risk of MLLMs: the output preference of MLLMs can be arbitrarily manipulated by carefully optimized images. Such attacks often generate contextually relevant yet biased responses that are neither overtly harmful nor unethical, making them difficult to detect. Specifically, we introduce a novel method, Preference Hijacking (Phi), for manipulating the MLLM response preferences using a preference hijacked image. Our method works at inference time and requires no model modifications. Additionally, we introduce a universal hijacking perturbation -- a transferable component that can be embedded into different images to hijack MLLM responses toward any attacker-specified preferences. Experimental results across various tasks demonstrate the effectiveness of our approach. The code for Phi is accessible at https://github.com/Yifan-Lan/Phi.