The method treats scaffold generation as part of the policy rather than a fixed hand-engineered harness. Reward from downstream rollouts is propagated back to scaffold construction, while safeguards such as immutable outer trust boundaries, deterministic monitors, and a frozen LLM judge reduce reward hacking.
Ornith-1.0 is useful for researchers studying agentic reinforcement learning, scaffold search, coding agents, and long-rollout optimization. The project highlights pipeline RL for asynchronous training and staleness weighting to control off-policy token effects during long trajectories.


