OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation

Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley, Joshua Hansen, Andrew Howe, Patrick Alan Johnson, Mark Otterlee, Ted Schmitt, Hunter Pitelka, Stephen Daspit, Rachel Ratner, Christopher Wilhelm, Sebastian Wood, Mike Jacobi, Hannah Kerner

2025-11-18

OlmoEarth: Stable Latent Image Modeling for Multimodal Earth Observation

Summary

This paper introduces OlmoEarth, a new artificial intelligence model designed to understand and analyze data collected from observing the Earth, like satellite images and videos.

What's the problem?

Analyzing Earth observation data is tricky because it's complex; it's not just like a regular image or a simple sequence of data. It has spatial information (where things are located), changes over time (like a video), and includes many different types of data all at once. Existing AI models struggle to handle all these aspects effectively.

What's the solution?

The researchers created OlmoEarth, a model specifically built for this type of data. They trained it using a new method where the model learns by trying to predict missing parts of the Earth observation data. This 'fill-in-the-blanks' approach, combined with a special way of organizing the data and a unique scoring system, allows OlmoEarth to understand the data better than previous models. They also built a platform around OlmoEarth to make it easier for others to use.

Why it matters?

OlmoEarth is important because it significantly improves our ability to analyze Earth observation data. This can help organizations like nonprofits and NGOs tackle big global challenges, such as tracking deforestation, monitoring climate change, or responding to natural disasters, by providing them with powerful AI tools and data management resources.

Abstract

Earth observation data presents a unique challenge: it is spatial like images, sequential like video or text, and highly multimodal. We present OlmoEarth: a multimodal, spatio-temporal foundation model that employs a novel self-supervised learning formulation, masking strategy, and loss all designed for the Earth observation domain. OlmoEarth achieves state-of-the-art performance compared to 12 other foundation models across a variety of research benchmarks and real-world tasks from external partners. When evaluating embeddings OlmoEarth achieves the best performance on 15 out of 24 tasks, and with full fine-tuning it is the best on 19 of 29 tasks. We deploy OlmoEarth as the backbone of an end-to-end platform for data collection, labeling, training, and inference of Earth observation models. The OlmoEarth Platform puts frontier foundation models and powerful data management tools into the hands of non-profits and NGOs working to solve the world's biggest problems. OlmoEarth source code, training data, and pre-trained weights are available at https://github.com/allenai/olmoearth_pretrain{https://github.com/allenai/olmoearth_pretrain}.

View Paper