Speak Easy: Eliciting Harmful Jailbreaks from LLMs with Simple Interactions
Yik Siu Chan, Narutatsu Ri, Yuxin Xiao, Marzyeh Ghassemi
2025-02-07
Summary
This paper talks about improving how AI models generate code for less common programming languages. It explores different methods to help these models perform better when there's not much training data available.
What's the problem?
AI models called Large Language Models (LLMs) are great at generating code, but they struggle with less popular programming languages because there's not enough data to train them properly. This leads to poorer performance compared to how they handle more common languages.
What's the solution?
The researchers tested three main approaches: fine-tuning the models with limited data, using in-context learning with special prompts, and teaching the models to translate between common and uncommon languages. They tried these methods on two less common languages (R and Racket) using six different AI models of various sizes.
Why it matters?
This research matters because it helps make AI coding assistants more useful for a wider range of programming languages. By finding ways to improve performance on less common languages, it could make these tools more accessible to developers working with niche or specialized programming languages, potentially speeding up software development in various fields.
Abstract
Despite extensive safety alignment efforts, large language models (LLMs) remain vulnerable to jailbreak attacks that elicit harmful behavior. While existing studies predominantly focus on attack methods that require technical expertise, two critical questions remain underexplored: (1) Are jailbroken responses truly useful in enabling average users to carry out harmful actions? (2) Do safety vulnerabilities exist in more common, simple human-LLM interactions? In this paper, we demonstrate that LLM responses most effectively facilitate harmful actions when they are both actionable and informative--two attributes easily elicited in multi-step, multilingual interactions. Using this insight, we propose HarmScore, a jailbreak metric that measures how effectively an LLM response enables harmful actions, and Speak Easy, a simple multi-step, multilingual attack framework. Notably, by incorporating Speak Easy into direct request and jailbreak baselines, we see an average absolute increase of 0.319 in Attack Success Rate and 0.426 in HarmScore in both open-source and proprietary LLMs across four safety benchmarks. Our work reveals a critical yet often overlooked vulnerability: Malicious users can easily exploit common interaction patterns for harmful intentions.