Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations

Yiyuan Zhang, Xiaohan Ding, Xiangyu Yue

2024-10-13

Scaling Up Your Kernels: Large Kernel Design in ConvNets towards Universal Representations

Summary

This paper discusses a new approach to designing Convolutional Neural Networks (ConvNets) using large convolutional kernels to improve performance and efficiency across various tasks.

What's the problem?

Traditional ConvNets often use many small kernels stacked together, which can limit their ability to capture important information from images. This design can lead to slower processing times and less effective learning, especially for complex tasks that require understanding broader spatial relationships.

What's the solution?

The authors propose a new architecture called UniRepLKNet that focuses on using fewer but larger kernels instead of many small ones. This method allows the model to capture more extensive spatial information without needing deep layers. They provide guidelines for creating these large-kernel ConvNets, which help optimize their performance. Their experiments show that this approach results in better accuracy on tasks like image classification and object detection, achieving impressive results such as 88.0% accuracy on the ImageNet dataset.

Why it matters?

This research is significant because it challenges the traditional design of ConvNets and demonstrates that using large kernels can lead to better performance in various applications, including not just image recognition but also audio and video processing. By improving how these models learn and process information, this work could enhance many AI systems used in real-world scenarios.

Abstract

This paper proposes the paradigm of large convolutional kernels in designing modern Convolutional Neural Networks (ConvNets). We establish that employing a few large kernels, instead of stacking multiple smaller ones, can be a superior design strategy. Our work introduces a set of architecture design guidelines for large-kernel ConvNets that optimize their efficiency and performance. We propose the UniRepLKNet architecture, which offers systematical architecture design principles specifically crafted for large-kernel ConvNets, emphasizing their unique ability to capture extensive spatial information without deep layer stacking. This results in a model that not only surpasses its predecessors with an ImageNet accuracy of 88.0%, an ADE20K mIoU of 55.6%, and a COCO box AP of 56.4% but also demonstrates impressive scalability and performance on various modalities such as time-series forecasting, audio, point cloud, and video recognition. These results indicate the universal modeling abilities of large-kernel ConvNets with faster inference speed compared with vision transformers. Our findings reveal that large-kernel ConvNets possess larger effective receptive fields and a higher shape bias, moving away from the texture bias typical of smaller-kernel CNNs. All codes and models are publicly available at https://github.com/AILab-CVC/UniRepLKNet promoting further research and development in the community.

View Paper