Basic Category Usage in Vision Language Models

Hunter Sawyer, Jesse Roberts, Kyle Moore

2025-03-18

Basic Category Usage in Vision Language Models

Summary

This collection of research paper titles explores recent advances and challenges in AI, covering areas like image and video generation, language models, robotics, and multimodal learning.

What's the problem?

The problems addressed include enhancing the quality, efficiency, and control of AI-generated content, improving the reasoning abilities of AI models, mitigating biases and safety risks, enabling AI to better interact with the real world, and creating more personalized and versatile AI systems.

What's the solution?

The solutions involve developing new models, training techniques, benchmarks, and evaluation methods. These include innovations in diffusion models, transformers, reinforcement learning, and multimodal learning. Specific solutions focus on improving image editing (Edit Transfer), generating consistent videos (CINEMA, Long Context Tuning), enabling robots to navigate and manipulate objects (UniGoal, adversarial data collection), adding speech to text-based models (From TOWER to SPIRE), and making AI models more fair (Group-robust Machine Unlearning).

Why it matters?

These advancements are important because they push the boundaries of AI capabilities, making AI more powerful, reliable, and beneficial for various applications. They also address critical challenges related to safety, fairness, and transparency, ensuring that AI is developed and deployed responsibly.

Abstract

The field of psychology has long recognized a basic level of categorization that humans use when labeling visual stimuli, a term coined by Rosch in 1976. This level of categorization has been found to be used most frequently, to have higher information density, and to aid in visual language tasks with priming in humans. Here, we investigate basic level categorization in two recently released, open-source vision-language models (VLMs). This paper demonstrates that Llama 3.2 Vision Instruct (11B) and Molmo 7B-D both prefer basic level categorization consistent with human behavior. Moreover, the models' preferences are consistent with nuanced human behaviors like the biological versus non-biological basic level effects and the well established expert basic level shift, further suggesting that VLMs acquire cognitive categorization behaviors from the human data on which they are trained.

View Paper