For-Value: Efficient Forward-Only Data Valuation for finetuning LLMs and VLMs

Wenlong Deng, Qi Zeng, Jiaming Zhang, Minghui Chen, Zixin Ding, Christos Thrampoulidis, Boying Gong, Xiaoxiao Li

2026-04-28

For-Value: Efficient Forward-Only Data Valuation for finetuning LLMs and VLMs

Summary

This paper introduces a new method, called For-Value, for figuring out how important each piece of data is when training powerful AI models like large language models (LLMs) and vision-language models (VLMs).

What's the problem?

Currently, determining the value of training data for these large AI models is really hard and slow. Existing methods require a lot of complex calculations involving 'backpropagation' – essentially retracing the steps the AI took to learn. This is incredibly demanding for models with billions of parameters and prevents using efficient batch processing to speed things up.

What's the solution?

For-Value offers a much faster way to assess data importance. Instead of backpropagation, it only needs to run the data 'forward' through the model once. The core idea is that how valuable a piece of data is can be understood by looking at how well the model's internal representations at the very end align with any errors the model makes when processing that data. This allows for a simple calculation and makes it possible to process data in batches, significantly speeding up the process.

Why it matters?

This research is important because understanding data value helps us build better and more trustworthy AI. It allows us to identify which data points are most crucial for learning, which data might be incorrectly labeled, and ultimately improve the performance and reliability of these increasingly powerful models. The efficiency gains also make it practical to apply data valuation to very large models that were previously too expensive to analyze.

Abstract

Data valuation is essential for enhancing the transparency and accountability of large language models (LLMs) and vision-language models (VLMs). However, existing methods typically rely on gradient computations, making them computationally prohibitive for billion-parameter models and precluding batch parallelization. In this work, we introduce For-Value, a forward-only data valuation framework that enables efficient batch-scalable value estimation while maintaining effectiveness. Leveraging the expressive power of pretrained LLMs/VLMs, we theoretically demonstrate that data valuation can be captured by the alignment between the final hidden representations and prediction errors at the last layer. In light of this insight, For-Value computes data value using a simple closed-form expression with a single forward pass, eliminating the need for costly backpropagation and enabling efficient batch calculating at scale. Extensive experiments show that For-Value matches or outperforms gradient-based baselines in detecting influential data and mislabeled data, while achieving significant efficiency improvements.

View Paper