The operational flow of Bloom is structured into four distinct, sequential pipeline stages, ensuring a rigorous and comprehensive evaluation process. It begins with the Understanding Agent, which interprets the target behavior and examples to grasp the underlying scientific motivation. Following this, the Ideation Agent creatively generates diverse evaluation scenarios designed to elicit the target behavior, utilizing intelligent batching for efficiency. The Rollout Agent then executes these generated interactions against the specified target model. Finally, the Judgment and Meta-Judgment Agents rigorously score the outcomes for the target behavior and any configured secondary qualities, with the Meta-Judgment synthesizing a comprehensive report on the findings.
The system offers substantial flexibility and control over the evaluation process, catering to various research needs, from quick local debugging to large-scale experiments managed via Weights & Biases. Users configure the entire run through a central `seed.yaml` file, specifying parameters like the target model, evaluation diversity, maximum conversation length, and whether to use extended reasoning effort or web search capabilities during scenario generation. Furthermore, Bloom supports seamless integration with external tools, including an interactive web-based viewer for browsing results and utilizes LiteLLM for unified API interaction across multiple LLM providers, facilitating model comparisons and reproducibility.

