TLDR: Token-Level Detective Reward Model for Large Vision Language Models
Deqing Fu, Tong Xiao, Rui Wang, Wang Zhu, Pengchuan Zhang, Guan Pang, Robin Jia, Lawrence Chen
2024-10-08

Summary
This paper introduces the Token-Level Detective Reward Model (TLDR), which improves how large vision language models learn by providing detailed feedback on individual parts of their outputs instead of just giving a simple yes or no.
What's the problem?
Existing reward models for training language models often provide very basic feedback, giving only a single binary response (like 'correct' or 'incorrect') for an entire piece of text, regardless of its length. This can lead to problems, especially in multimodal models that handle both images and text, as they might focus too much on the text and ignore important visual information. This simplistic approach can limit the model's ability to learn effectively and understand the nuances of both text and images.
What's the solution?
To address these issues, the authors developed TLDR, which gives more detailed feedback by evaluating each individual word or token in a piece of text. They created a method to generate challenging examples (called hard negatives) that help train the model to recognize errors more accurately. By using this fine-grained feedback, TLDR helps models improve their output quality and can also assist in identifying when the model produces incorrect or nonsensical information (known as hallucinations). Additionally, TLDR speeds up the process of human annotation, allowing for faster collection of high-quality training data.
Why it matters?
This research is important because it enhances the training process for large language models by providing more informative feedback. By improving how these models learn from both text and images, TLDR could lead to better performance in applications like automated content generation, image captioning, and other tasks that require a deep understanding of multimodal information.
Abstract
Although reward models have been successful in improving multimodal large language models, the reward models themselves remain brutal and contain minimal information. Notably, existing reward models only mimic human annotations by assigning only one binary feedback to any text, no matter how long the text is. In the realm of multimodal language models, where models are required to process both images and texts, a naive reward model may learn implicit biases toward texts and become less grounded in images. In this paper, we propose a Token-Level Detective Reward Model (TLDR) to provide fine-grained annotations to each text token. We first introduce a perturbation-based method to generate synthetic hard negatives and their token-level labels to train TLDR models. Then we show the rich usefulness of TLDR models both in assisting off-the-shelf models to self-correct their generations, and in serving as a hallucination evaluation tool. Finally, we show that TLDR models can significantly speed up human annotation by 3 times to acquire a broader range of high-quality vision language data.