Zipfian Whitening

Sho Yokoi, Han Bao, Hiroto Kurita, Hidetoshi Shimodaira

2024-11-04

Summary

This paper introduces a method called Zipfian Whitening, which improves how words are represented in neural models by correcting the uneven distribution of word frequencies. This adjustment helps enhance the performance of language tasks.

What's the problem?

In language models, the way words are represented can be skewed because most methods assume that all words occur with equal frequency. However, in reality, some words are used much more often than others, following a pattern known as Zipf's law. This mismatch can lead to poorer performance in tasks that rely on understanding language.

What's the solution?

The authors propose using a technique called PCA whitening that takes into account the actual frequencies of words according to Zipf's law. By weighting the word representations based on their real-world usage, they find that this method significantly improves the model's ability to understand and generate text. They categorize their approach alongside existing methods, highlighting its effectiveness in emphasizing less frequent but informative words.

Why it matters?

This research is important because it provides a better way to represent words in AI models, leading to improved performance in various natural language processing tasks. By acknowledging the real distribution of word usage, Zipfian Whitening helps create more accurate and efficient language models, which can enhance applications like chatbots, translation services, and content generation.

Abstract

The word embedding space in neural models is skewed, and correcting this can improve task performance. We point out that most approaches for modeling, correcting, and measuring the symmetry of an embedding space implicitly assume that the word frequencies are uniform; in reality, word frequencies follow a highly non-uniform distribution, known as Zipf's law. Surprisingly, simply performing PCA whitening weighted by the empirical word frequency that follows Zipf's law significantly improves task performance, surpassing established baselines. From a theoretical perspective, both our approach and existing methods can be clearly categorized: word representations are distributed according to an exponential family with either uniform or Zipfian base measures. By adopting the latter approach, we can naturally emphasize informative low-frequency words in terms of their vector norm, which becomes evident from the information-geometric perspective, and in terms of the loss functions for imbalanced classification. Additionally, our theory corroborates that popular natural language processing methods, such as skip-gram negative sampling, WhiteningBERT, and headless language models, work well just because their word embeddings encode the empirical word frequency into the underlying probabilistic model.

View Paper