Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation

Yuyang Li, Yinghan Chen, Zihang Zhao, Puhao Li, Tengyu Liu, Siyuan Huang, Yixin Zhu

2025-12-18

Simultaneous Tactile-Visual Perception for Learning Multimodal Robot Manipulation

Summary

This research focuses on improving how robots interact with the world by giving them better senses and teaching them how to use those senses effectively.

What's the problem?

Current robotic hands often struggle with complex tasks because they don't get enough information about what they're touching and seeing *at the same time*. Existing 'see-through-skin' sensors, which try to combine touch and vision, either don't capture both types of information simultaneously or have trouble accurately tracking touch. It's also hard to actually use this combined information to *teach* a robot how to do things.

What's the solution?

The researchers created a new sensor called TacThru, which is a flexible 'skin' for robots that allows them to 'see' and 'feel' at the same time, and do so reliably. They also developed a learning system, TacThru-UMI, that uses a powerful type of artificial intelligence called a Transformer-based Diffusion Policy to interpret the information from the sensor and learn how to manipulate objects. This system essentially learns by watching demonstrations.

Why it matters?

This work is important because it shows that by giving robots better, more complete sensory information and using advanced learning techniques, we can create robots that are much more skilled at handling objects, even delicate or soft ones, and performing complex tasks with greater precision and adaptability. This could lead to robots that are more helpful in manufacturing, healthcare, and everyday life.

Abstract

Robotic manipulation requires both rich multimodal perception and effective learning frameworks to handle complex real-world tasks. See-through-skin (STS) sensors, which combine tactile and visual perception, offer promising sensing capabilities, while modern imitation learning provides powerful tools for policy acquisition. However, existing STS designs lack simultaneous multimodal perception and suffer from unreliable tactile tracking. Furthermore, integrating these rich multimodal signals into learning-based manipulation pipelines remains an open challenge. We introduce TacThru, an STS sensor enabling simultaneous visual perception and robust tactile signal extraction, and TacThru-UMI, an imitation learning framework that leverages these multimodal signals for manipulation. Our sensor features a fully transparent elastomer, persistent illumination, novel keyline markers, and efficient tracking, while our learning system integrates these signals through a Transformer-based Diffusion Policy. Experiments on five challenging real-world tasks show that TacThru-UMI achieves an average success rate of 85.5%, significantly outperforming the baselines of alternating tactile-visual (66.3%) and vision-only (55.4%). The system excels in critical scenarios, including contact detection with thin and soft objects and precision manipulation requiring multimodal coordination. This work demonstrates that combining simultaneous multimodal perception with modern learning frameworks enables more precise, adaptable robotic manipulation.

View Paper