< Explain other AI papers

HISTAI: An Open-Source, Large-Scale Whole Slide Image Dataset for Computational Pathology

Dmitry Nechaev, Alexey Pchelnikov, Ekaterina Ivanova

2025-05-20

HISTAI: An Open-Source, Large-Scale Whole Slide Image Dataset for
  Computational Pathology

Summary

This paper talks about HISTAI, a huge new collection of medical images that can be used by researchers and AI developers to help computers better understand and diagnose diseases from tissue samples.

What's the problem?

The problem is that most public collections of medical slide images are too small, don't include enough different kinds of tissues, or are missing important information about the patients and their diagnoses. This makes it hard for AI models to learn well and work reliably in real medical situations.

What's the solution?

To solve this, the researchers created the HISTAI dataset, which includes over 60,000 medical slides from many different tissue types, along with lots of detailed information about each case, such as diagnosis, patient background, and expert notes. This dataset is open for anyone to use and is designed to help AI models learn more effectively and be more accurate.

Why it matters?

This matters because having a big, diverse, and well-annotated dataset will help scientists and doctors build better AI tools for diagnosing diseases, leading to more accurate results and improved patient care in hospitals and clinics.

Abstract

Recent advancements in Digital Pathology (DP), particularly through artificial intelligence and Foundation Models, have underscored the importance of large-scale, diverse, and richly annotated datasets. Despite their critical role, publicly available Whole Slide Image (WSI) datasets often lack sufficient scale, tissue diversity, and comprehensive clinical metadata, limiting the robustness and generalizability of AI models. In response, we introduce the HISTAI dataset, a large, multimodal, open-access WSI collection comprising over 60,000 slides from various tissue types. Each case in the HISTAI dataset is accompanied by extensive clinical metadata, including diagnosis, demographic information, detailed pathological annotations, and standardized diagnostic coding. The dataset aims to fill gaps identified in existing resources, promoting innovation, reproducibility, and the development of clinically relevant computational pathology solutions. The dataset can be accessed at https://github.com/HistAI/HISTAI.