AutoNeural: Co-Designing Vision-Language Models for NPU Inference

Wei Chen, Liangmin Wu, Yunhai Hu, Zhiyuan Li, Zhiyuan Cheng, Yicheng Qian, Lingyue Zhu, Zhipeng Hu, Luoyi Liang, Qiang Tang, Zhen Liu, Han Yang

2025-12-04

AutoNeural: Co-Designing Vision-Language Models for NPU Inference

Summary

This paper focuses on making advanced AI models that understand both images and language, called Vision-Language Models (VLMs), work efficiently on specialized computer chips called Neural Processing Units (NPUs). These NPUs are designed for AI tasks on devices like phones and cars, but current VLMs built for standard computer chips often don't perform well on them.

What's the problem?

The main issue is that VLMs designed for powerful graphics cards (GPUs) aren't well-suited for NPUs. This is because of two things: first, a key part of these models, called Vision Transformers, lose accuracy when simplified to work on NPUs. Second, the way these models process information step-by-step, using something called 'attention,' requires a lot of data moving back and forth, which slows down the NPU because it's really good at calculations but not so good at quickly accessing memory.

What's the solution?

The researchers created a new VLM architecture called AutoNeural specifically for NPUs. They replaced the standard Vision Transformer with a different image processing method using 'depthwise separable convolutions' which maintains accuracy even when simplified. They also changed the language processing part to use a combination of traditional Transformer layers and a more efficient method called 'State-Space Models' with 'gated convolutions,' which reduces the need to constantly access memory. This new design allows for faster processing and longer context understanding.

Why it matters?

This work is important because it shows that simply taking a model designed for one type of chip and trying to run it on another doesn't work well. To truly unlock the potential of AI on edge devices like cars and phones, we need to design models specifically with the limitations and strengths of those chips in mind. AutoNeural demonstrates a significant improvement in speed and efficiency, paving the way for more powerful and responsive AI applications in the real world.

Abstract

While Neural Processing Units (NPUs) offer high theoretical efficiency for edge AI, state-of-the-art Vision--Language Models (VLMs) tailored for GPUs often falter on these substrates. We attribute this hardware-model mismatch to two primary factors: the quantization brittleness of Vision Transformers (ViTs) and the I/O-bound nature of autoregressive attention mechanisms, which fail to utilize the high arithmetic throughput of NPUs. To bridge this gap, we propose AutoNeural, an NPU-native VLM architecture co-designed for integer-only inference. We replace the standard ViT encoder with a MobileNetV5-style backbone utilizing depthwise separable convolutions, which ensures bounded activation distributions for stable INT4/8/16 quantization. Complementing this, our language backbone integrates State-Space Model (SSM) principles with Transformer layers, employing efficient gated convolutions to achieve linear-time complexity. This hybrid design eliminates the heavy memory I/O overhead of Key-Value caching during generation. Our approach delivers substantial efficiency gains, reducing quantization error of vision encoder by up to 7x and end-to-end latency by 14x compared to conventional baselines. The AutoNeural also delivers 3x decoding speed and 4x longer context window than the baseline. We validate these improvements via a real-world automotive case study on the Qualcomm SA8295P SoC, demonstrating real-time performance for cockpit applications. Our results highlight that rethinking model topology specifically for NPU constraints is a prerequisite for robust multi-modal edge intelligence.

View Paper