Unveiling Intrinsic Dimension of Texts: from Academic Abstract to Creative Story
Vladislav Pedashenko, Laida Kushnareva, Yana Khassan Nibal, Eduard Tulchinskii, Kristian Kuznetsov, Vladislav Zharchinskii, Yury Maximov, Irina Piontkovskaya
2025-11-24
Summary
This research investigates what aspects of text influence how complex large language models (LLMs) find it to represent that text, using a measure called 'intrinsic dimension'. It aims to connect the way LLMs process information to specific characteristics of the writing itself.
What's the problem?
While 'intrinsic dimension' is a useful tool for understanding LLMs, it wasn't clear *why* certain texts have higher or lower intrinsic dimensions. Researchers knew *that* different types of writing affected it, but not *which* specific features were responsible, or how it related to how well a model could predict the text.
What's the solution?
The researchers used several techniques to figure this out. They analyzed text with a 'cross-encoder' to understand relationships, looked at linguistic features like tone and style, and used 'sparse autoencoders' to pinpoint the most important parts of the text that influence the intrinsic dimension. They also did experiments where they intentionally changed the text to see how it affected the dimension, proving cause and effect. They found that scientific writing is easier for LLMs to represent, while creative writing is more complex.
Why it matters?
This work helps us better understand how LLMs 'think' about different kinds of text. It shows that intrinsic dimension isn't just about how predictable a text is, but also about its geometric complexity. Knowing that scientific writing is simpler for LLMs and that features like emotion and narrative increase complexity can help us use these models more effectively and interpret results based on intrinsic dimension more accurately.
Abstract
Intrinsic dimension (ID) is an important tool in modern LLM analysis, informing studies of training dynamics, scaling behavior, and dataset structure, yet its textual determinants remain underexplored. We provide the first comprehensive study grounding ID in interpretable text properties through cross-encoder analysis, linguistic features, and sparse autoencoders (SAEs). In this work, we establish three key findings. First, ID is complementary to entropy-based metrics: after controlling for length, the two are uncorrelated, with ID capturing geometric complexity orthogonal to prediction quality. Second, ID exhibits robust genre stratification: scientific prose shows low ID (~8), encyclopedic content medium ID (~9), and creative/opinion writing high ID (~10.5) across all models tested. This reveals that contemporary LLMs find scientific text "representationally simple" while fiction requires additional degrees of freedom. Third, using SAEs, we identify causal features: scientific signals (formal tone, report templates, statistics) reduce ID; humanized signals (personalization, emotion, narrative) increase it. Steering experiments confirm these effects are causal. Thus, for contemporary models, scientific writing appears comparatively "easy", whereas fiction, opinion, and affect add representational degrees of freedom. Our multi-faceted analysis provides practical guidance for the proper use of ID and the sound interpretation of ID-based results.