Qwen3Guard Technical Report
Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, Baosong Yang, Chen Cheng, Jialong Tang, Jiandong Jiang, Jianwei Zhang, Jijie Xu, Ming Yan, Minmin Sun, Pei Zhang, Pengjun Xie, Qiaoyu Tang, Qin Zhu
2025-10-17
Summary
This paper introduces Qwen3Guard, a new set of tools designed to make large language models (LLMs) safer to use, especially as they become more common.
What's the problem?
Currently, systems that check LLM outputs for safety have a couple of big issues. First, they usually just say if something is 'safe' or 'unsafe,' which isn't helpful because different people and situations have different ideas about what's acceptable. Second, these checks happen *after* the LLM has finished generating its response, meaning harmful content could be created and shown to someone before it's flagged. This is a problem for LLMs that generate text bit by bit, or 'stream' their responses.
What's the solution?
The researchers created Qwen3Guard, which comes in two versions. One version, Generative Qwen3Guard, doesn't just say 'safe' or 'unsafe' but gives a more nuanced judgment: safe, potentially controversial, or unsafe. The other version, Stream Qwen3Guard, can analyze the text as the LLM is *creating* it, checking each piece as it comes out, allowing for immediate intervention if something unsafe is detected. Qwen3Guard also works in many different languages – over 100 – and comes in different sizes to balance accuracy and speed.
Why it matters?
This work is important because it makes LLMs more reliable and responsible. By providing more detailed safety assessments and enabling real-time monitoring, Qwen3Guard helps prevent the spread of harmful or inappropriate content, making these powerful AI tools safer for everyone to use globally.
Abstract
As large language models (LLMs) become more capable and widely used, ensuring the safety of their outputs is increasingly critical. Existing guardrail models, though useful in static evaluation settings, face two major limitations in real-world applications: (1) they typically output only binary "safe/unsafe" labels, which can be interpreted inconsistently across diverse safety policies, rendering them incapable of accommodating varying safety tolerances across domains; and (2) they require complete model outputs before performing safety checks, making them fundamentally incompatible with streaming LLM inference, thereby preventing timely intervention during generation and increasing exposure to harmful partial outputs. To address these challenges, we present Qwen3Guard, a series of multilingual safety guardrail models with two specialized variants: Generative Qwen3Guard, which casts safety classification as an instruction-following task to enable fine-grained tri-class judgments (safe, controversial, unsafe); and Stream Qwen3Guard, which introduces a token-level classification head for real-time safety monitoring during incremental text generation. Both variants are available in three sizes (0.6B, 4B, and 8B parameters) and support up to 119 languages and dialects, providing comprehensive, scalable, and low-latency safety moderation for global LLM deployments. Evaluated across English, Chinese, and multilingual benchmarks, Qwen3Guard achieves state-of-the-art performance in both prompt and response safety classification. All models are released under the Apache 2.0 license for public use.