Key Features

Optimizes both task scaffolds and solution rollouts through reinforcement learning.
Lets task-specific orchestration strategies emerge instead of hand-coding every harness.
Propagates rollout reward back into the scaffold-authoring stage.
Uses immutable outer trust boundaries to constrain self-generated scaffolds.
Adds deterministic monitoring for forbidden environment or verifier access.
Uses a frozen LLM judge as an additional veto against intent-level reward hacking.
Applies pipeline RL for long asynchronous rollouts.
Uses staleness weighting to downweight older off-policy tokens.

The method treats scaffold generation as part of the policy rather than a fixed hand-engineered harness. Reward from downstream rollouts is propagated back to scaffold construction, while safeguards such as immutable outer trust boundaries, deterministic monitors, and a frozen LLM judge reduce reward hacking.


Ornith-1.0 is useful for researchers studying agentic reinforcement learning, scaffold search, coding agents, and long-rollout optimization. The project highlights pipeline RL for asynchronous training and staleness weighting to control off-policy token effects during long trajectories.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner
Zero to AI Engineer Program

Zero to AI Engineer

Skip the degree. Learn real-world AI skills used by AI researchers and engineers. Get certified in 8 weeks or less. No experience required.

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!