SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

Jianping Jiang, Weiye Xiao, Zhengyu Lin, Huaizhong Zhang, Tianxiang Ren, Yang Gao, Zhiqian Lin, Zhongang Cai, Lei Yang, Ziwei Liu

2024-12-03

SOLAMI: Social Vision-Language-Action Modeling for Immersive Interaction with 3D Autonomous Characters

Summary

This paper introduces SOLAMI, a new framework designed to create 3D characters that can understand and interact with humans in a social way, using a combination of vision, language, and action.

What's the problem?

Creating 3D characters that can interact with people like humans do is a major challenge. Current characters often lack the social intelligence needed to perceive and respond to human behavior effectively. They typically can only respond through basic text or voice, which limits their ability to engage in meaningful conversations and interactions.

What's the solution?

SOLAMI addresses this problem by developing a comprehensive system that allows 3D characters to process user inputs (like speech and motion) and generate appropriate responses in both speech and movement. It includes three main components: a Social VLA Architecture that combines different types of responses, a synthetic dataset called SynMSI for training the characters on social interactions, and an immersive virtual reality interface that lets users interact with these characters. This approach helps the characters respond more naturally and accurately during interactions.

Why it matters?

This research is important because it enhances the way we can create interactive digital characters that mimic human social behavior. By improving how these characters understand and respond to human actions, SOLAMI can be used in various applications such as video games, virtual reality environments, and social robots, making interactions with technology feel more natural and engaging.

Abstract

Human beings are social animals. How to equip 3D autonomous characters with similar social intelligence that can perceive, understand and interact with humans remains an open yet foundamental problem. In this paper, we introduce SOLAMI, the first end-to-end Social vision-Language-Action (VLA) Modeling framework for Immersive interaction with 3D autonomous characters. Specifically, SOLAMI builds 3D autonomous characters from three aspects: (1) Social VLA Architecture: We propose a unified social VLA framework to generate multimodal response (speech and motion) based on the user's multimodal input to drive the character for social interaction. (2) Interactive Multimodal Data: We present SynMSI, a synthetic multimodal social interaction dataset generated by an automatic pipeline using only existing motion datasets to address the issue of data scarcity. (3) Immersive VR Interface: We develop a VR interface that enables users to immersively interact with these characters driven by various architectures. Extensive quantitative experiments and user studies demonstrate that our framework leads to more precise and natural character responses (in both speech and motion) that align with user expectations with lower latency.

View Paper