A Flexible Large Language Models Guardrail Development Methodology Applied to Off-Topic Prompt Detection
Gabriel Chua, Shing Yee Chan, Shaun Khoo
2024-11-25

Summary
This paper introduces a new method for developing guardrails for large language models (LLMs) to prevent them from responding to off-topic prompts, which can lead to misuse of the models.
What's the problem?
LLMs are powerful tools, but they can be misused when users prompt them to perform tasks that are outside their intended purpose. Current methods for preventing this misuse often rely on specific examples or classifiers, which can lead to many false positives (incorrectly identifying a prompt as off-topic) and are not very adaptable to new types of misuse. Additionally, these methods often require real-world data that is not available before the model is fully developed.
What's the solution?
The authors propose a flexible guardrail development methodology that does not require real-world data. Instead, they define the problem clearly and use an LLM to generate synthetic prompts that could be off-topic. This synthetic dataset is then used to train and benchmark guardrails that effectively detect off-topic prompts. The approach helps improve the accuracy of the models while reducing false positives, and it can also be applied to other misuse types, such as harmful prompts.
Why it matters?
This research is important because it provides a more effective way to ensure that LLMs operate within their intended scope, making them safer and more reliable for users. By open-sourcing their synthetic dataset and guardrail models, the authors contribute valuable resources for further research and development in AI safety, helping to promote responsible use of language models.
Abstract
Large Language Models are prone to off-topic misuse, where users may prompt these models to perform tasks beyond their intended scope. Current guardrails, which often rely on curated examples or custom classifiers, suffer from high false-positive rates, limited adaptability, and the impracticality of requiring real-world data that is not available in pre-production. In this paper, we introduce a flexible, data-free guardrail development methodology that addresses these challenges. By thoroughly defining the problem space qualitatively and passing this to an LLM to generate diverse prompts, we construct a synthetic dataset to benchmark and train off-topic guardrails that outperform heuristic approaches. Additionally, by framing the task as classifying whether the user prompt is relevant with respect to the system prompt, our guardrails effectively generalize to other misuse categories, including jailbreak and harmful prompts. Lastly, we further contribute to the field by open-sourcing both the synthetic dataset and the off-topic guardrail models, providing valuable resources for developing guardrails in pre-production environments and supporting future research and development in LLM safety.