No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

Min Woo Sun, Alejandro Lozano, Javier Gamazo Tejero, Vishwesh Nath, Xiao Xiao Sun, James Burgess, Yuhui Zhang, Kun Yuan, Robert Tibshirani, Sean Huver, Serena Yeung-Levy

2025-10-08

No Tokens Wasted: Leveraging Long Context in Biomedical Vision-Language Models

Summary

This paper investigates how well vision-language models, which 'look' at images and 'read' text, work when given much longer descriptions to read. They found that giving these models access to more text actually helps them understand images better, especially in the medical field.

What's the problem?

Current vision-language models are trained using relatively short captions, meaning they often have to cut off important information when dealing with longer, more detailed descriptions. This is a big issue in areas like biomedical imaging where captions from research papers are often quite lengthy and contain crucial details. Essentially, the models are missing out on a lot of useful information because of this limitation.

What's the solution?

The researchers created a new dataset called BIOMEDICA-LongCAP, which contains a million images paired with long, detailed captions taken from full research articles. They then used this dataset to train a new model, BMC-LongCLIP, that can process much longer text inputs – up to 512 tokens, which is a significant increase. This allows the model to use almost all of the caption information instead of discarding over half of it.

Why it matters?

This work shows that enabling vision-language models to consider longer context improves their performance on tasks like finding relevant images based on text descriptions and classifying images correctly. This is particularly important in the biomedical field, where accurate image understanding can lead to better diagnoses and research. It suggests that building models that can handle longer text is a promising path forward for improving these types of AI systems.

Abstract

Embedding vision-language models (VLMs) are typically pretrained with short text windows (<77 tokens), which forces the truncation of long-format captions. Yet, the distribution of biomedical captions from large-scale open source literature reveals that a huge portion of captions far exceed 77 tokens. To this end, we investigate the impact of pretraining on long-format biomedical captions by extending the context length of text encoders in VLMs. We find that longer context (thus, enabling additional supervision provided in long-format captions) correlates with better retrieval and classification performance. Given this finding, we introduce BIOMEDICA-LongCAP, a dataset of 1M image-caption pairs enriched with context-aware descriptions from full-text articles, providing longer and additional textual supervision. Using BIOMEDICA-LongCAP, we train BMC-LongCLIP, a long-context biomedical VLM with a text encoder supporting windows of up to 512 tokens. Our model extends context capacity by 6.6x, reducing token waste from 55% to just 2.2%. On long-caption retrieval benchmarks, BMC-LongCLIP achieves up to +30% absolute gains in Recall@1 and +2% average improvements in classification, while also converging faster than short-context. Our results demonstrate that long-context modeling is a promising direction for advancing biomedical VLMs.

View Paper