Safe Flow Q-Learning: Offline Safe Reinforcement Learning with Reachability-Based Flow Policies
Mumuksh Tayal, Manan Tayal, Ravi Prakash
2026-03-24
Summary
This paper introduces a new method called Safe Flow Q-Learning (SafeFQL) for teaching AI agents to make good decisions from existing data, while also ensuring they stay safe and avoid dangerous situations.
What's the problem?
Currently, teaching AI agents from past data without letting them experiment in the real world is tricky, especially when safety is crucial. Existing methods either aren't strict enough in guaranteeing safety or are too slow to react in real-time situations like controlling a robot or navigating a vehicle. They often rely on complex calculations or repeatedly generating possible safe actions, which takes time.
What's the solution?
SafeFQL builds upon a technique called Flow Q-Learning and adds a safety component. It learns what actions are safe by figuring out a 'safety boundary' and then uses this information to guide the agent's choices. The agent learns to mimic safe behaviors from the data and then uses a simplified 'one-step' approach to quickly pick the best safe action without needing to constantly check if it's safe. To deal with imperfections in the learned safety boundary, it also uses a technique called conformal prediction to provide a guarantee of safety, even with limited data.
Why it matters?
This research is important because it allows for the creation of safer AI systems that can operate in the real world without constant human supervision. SafeFQL is faster than other safe learning methods, making it suitable for applications where quick reactions are essential, like self-driving cars or robotics. It shows promising results in tasks like boat navigation and controlling robotic arms, demonstrating its potential for real-world use.
Abstract
Offline safe reinforcement learning (RL) seeks reward-maximizing policies from static datasets under strict safety constraints. Existing methods often rely on soft expected-cost objectives or iterative generative inference, which can be insufficient for safety-critical real-time control. We propose Safe Flow Q-Learning (SafeFQL), which extends FQL to safe offline RL by combining a Hamilton--Jacobi reachability-inspired safety value function with an efficient one-step flow policy. SafeFQL learns the safety value via a self-consistency Bellman recursion, trains a flow policy by behavioral cloning, and distills it into a one-step actor for reward-maximizing safe action selection without rejection sampling at deployment. To account for finite-data approximation error in the learned safety boundary, we add a conformal prediction calibration step that adjusts the safety threshold and provides finite-sample probabilistic safety coverage. Empirically, SafeFQL trades modestly higher offline training cost for substantially lower inference latency than diffusion-style safe generative baselines, which is advantageous for real-time safety-critical deployment. Across boat navigation, and Safety Gymnasium MuJoCo tasks, SafeFQL matches or exceeds prior offline safe RL performance while substantially reducing constraint violations.