The Strong Lottery Ticket Hypothesis for Multi-Head Attention Mechanisms
Hikari Otsuka, Daiki Chijiwa, Yasuyuki Okoshi, Daichi Fujiki, Susumu Takeuchi, Masato Motomura
2025-11-07
Summary
This paper investigates a concept called the 'strong lottery ticket hypothesis' (SLTH) within transformer neural networks, which are commonly used in things like language processing. The idea is that within a randomly created network, there are smaller, more efficient networks hidden that can perform just as well as the larger one.
What's the problem?
While the SLTH has been proven to work for many types of neural networks, it hasn't been fully understood for transformers. A key part of transformers, called 'multi-head attention,' hadn't been theoretically explained in relation to the SLTH. Essentially, researchers didn't know *why* smaller, efficient networks existed within transformers, specifically considering how multi-head attention works.
What's the solution?
The researchers developed a mathematical proof showing that if the 'hidden dimension' (a technical detail about the network's size) is large enough, transformers *do* contain these hidden, efficient networks – the 'strong lottery tickets.' They specifically focused on the multi-head attention part and showed how it contributes to this. They then expanded this understanding to transformers that don't use a specific type of layer called 'normalization layers.' Finally, they ran experiments to confirm their theory, showing that increasing the hidden dimension makes the smaller network even more accurate at mimicking the larger one.
Why it matters?
This work is important because it helps us understand why transformers are so powerful and how we can make them more efficient. Finding these 'strong lottery tickets' means we could potentially train smaller, faster transformers that perform just as well as the huge ones currently used, saving computational resources and energy. It provides a theoretical foundation for optimizing transformer models.
Abstract
The strong lottery ticket hypothesis (SLTH) conjectures that high-performing subnetworks, called strong lottery tickets (SLTs), are hidden in randomly initialized neural networks. Although recent theoretical studies have established the SLTH across various neural architectures, the SLTH for transformer architectures still lacks theoretical understanding. In particular, the current theory of the SLTH does not yet account for the multi-head attention (MHA) mechanism, a core component of transformers. To address this gap, we introduce a theoretical analysis of the existence of SLTs within MHAs. We prove that, if a randomly initialized MHA of H heads and input dimension d has the hidden dimension O(dlog(Hd^{3/2})) for the key and value, it contains an SLT that approximates an arbitrary MHA with the same input dimension with high probability. Furthermore, by leveraging this theory for MHAs, we extend the SLTH to transformers without normalization layers. We empirically validate our theoretical findings, demonstrating that the approximation error between the SLT within a source model (MHA and transformer) and an approximate target counterpart decreases exponentially by increasing the hidden dimension of the source model.