Iwin Transformer: Hierarchical Vision Transformer using Interleaved Windows

Simin Huo, Ning Li

2025-07-25

Iwin Transformer: Hierarchical Vision Transformer using Interleaved
Windows

Summary

This paper talks about the Iwin Transformer, a type of vision transformer that improves how computers understand images and videos by using a new attention method called interleaved window attention and combining it with convolution to share information globally.

What's the problem?

Traditional vision transformers needed special position information and multiple steps to connect details in different parts of an image, which made them complex and slow, especially for high-resolution images or videos.

What's the solution?

The researchers designed Iwin Transformer to use interleaved windows that mix pixels from different image areas for attention and depthwise separable convolution to connect nearby pixels. This approach allows the model to pass information globally in a single step without needing explicit position data, improving speed and accuracy.

Why it matters?

This matters because Iwin Transformer achieves strong results in important tasks like image classification, recognizing video actions, and segmenting images, while being more efficient. It also opens new possibilities for future vision models in video generation and beyond.

Abstract

Iwin Transformer, a hierarchical vision transformer without position embeddings, uses interleaved window attention and depthwise separable convolution to enable global information exchange, achieving competitive performance in image classification, semantic segmentation, and video action recognition.

View Paper