Virtual Width Networks

Seed, Baisheng Li, Banggu Wu, Bole Ma, Bowen Xiao, Chaoyi Zhang, Cheng Li, Chengyi Wang, Chenyin Xu, Chi Zhang, Chong Hu, Daoguang Zan, Defa Zhu, Dongyu Xu, Du Li, Faming Wu, Fan Xia, Ge Zhang, Guang Shi, Haobin Chen, Hongyu Zhu, Hongzhi Huang

2025-11-17

Summary

This paper introduces a new technique called Virtual Width Networks (VWN) that aims to improve the performance of large language models without drastically increasing the computational resources needed to train them.

What's the problem?

Large language models are getting bigger and bigger to achieve better results, but increasing the 'width' of these models – essentially the size of their internal representations – quickly becomes very expensive in terms of computing power and time. The cost increases dramatically, making it hard to scale these models effectively. It's like trying to build a wider highway, but the cost of each extra lane goes up exponentially.

What's the solution?

VWN tackles this problem by cleverly separating how the model *represents* information from the actual *size* of the core processing part. It expands the 'embedding space' – where words and concepts are turned into numerical representations – making it wider, but without significantly increasing the computational load of the main 'backbone' of the model. Think of it like adding extra rooms to a house without rebuilding the foundation. They found that expanding this virtual width speeds up the learning process, both for predicting the next word and predicting sequences of words.

Why it matters?

This work is important because it offers a new way to improve large language models without the huge cost increases that come with simply making them bigger. The researchers showed that increasing 'virtual width' leads to faster training and better performance, and they even found a predictable relationship between how much you expand the width and how much the model improves. This suggests that 'virtual width' could become a key factor in building more efficient and powerful AI systems in the future.

Abstract

We introduce Virtual Width Networks (VWN), a framework that delivers the benefits of wider representations without incurring the quadratic cost of increasing the hidden size. VWN decouples representational width from backbone width, expanding the embedding space while keeping backbone compute nearly constant. In our large-scale experiment, an 8-times expansion accelerates optimization by over 2 times for next-token and 3 times for next-2-token prediction. The advantage amplifies over training as both the loss gap grows and the convergence-speedup ratio increases, showing that VWN is not only token-efficient but also increasingly effective with scale. Moreover, we identify an approximately log-linear scaling relation between virtual width and loss reduction, offering an initial empirical basis and motivation for exploring virtual-width scaling as a new dimension of large-model efficiency.

View Paper