BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks

Samuel Stevens

2025-11-21

BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks

Summary

This paper introduces BioBench, a new benchmark designed to test how well computer vision models perform on real-world ecological images, arguing that the standard test, ImageNet, isn't a good measure of performance in this field.

What's the problem?

Currently, researchers use a test called ImageNet to see how good a computer vision model is at 'seeing' and understanding images. However, this paper shows that doing well on ImageNet doesn't actually mean a model will do well when looking at images from nature, like pictures of plants, animals, or ecosystems. ImageNet simply doesn't capture the complexities of ecological imagery, and can even mislead researchers about which models are truly effective for environmental science.

What's the solution?

To address this, the researchers created BioBench, a large collection of over 3 million images from various sources like drone footage, microscopic images, and camera trap photos. These images cover a wide range of living things – plants, fungi, fish, and more – and different ways of collecting the images. They also provide a simple way to test models on this data and get a clear measure of how well they perform using a metric called macro-F1, which is good for imbalanced datasets. The whole process is designed to be relatively quick and easy to run.

Why it matters?

BioBench is important because it provides a more realistic and reliable way to evaluate computer vision models for ecological applications. This will help researchers develop better AI tools for studying and protecting the environment, and it also serves as a model for creating similar benchmarks in other scientific fields where standard tests aren't sufficient.

Abstract

ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at https://github.com/samuelstevens/biobench and results at https://samuelstevens.me/biobench.

View Paper