Language Models Model Language
Łukasz Borchmann
2025-10-20
Summary
This paper argues that the way we've been thinking about how well large language models (LLMs) understand language is wrong, and proposes a new way to look at it based on the work of a linguist named Witold Mańczak.
What's the problem?
Currently, many people judge LLMs based on older ideas about language from thinkers like Saussure and Chomsky. These ideas suggest language needs a hidden, underlying structure and a connection to real-world experience to be truly understood. Because LLMs don't necessarily *have* these things, critics argue they don't actually 'understand' language, just mimic it. This leads to unproductive debates about whether LLMs can ever achieve true linguistic 'competence'.
What's the solution?
The paper suggests we shift our focus to the ideas of Witold Mańczak, who believed language is simply everything that *is* said and written. He emphasized that how often words and phrases are used is the most important factor in how language works. By applying this idea, the authors argue we can stop worrying about whether LLMs have 'deep understanding' and instead focus on how well they reflect actual language use. This provides a more practical way to build, test, and interpret these models.
Why it matters?
This new perspective is important because it offers a more constructive way to evaluate LLMs. Instead of setting unrealistic expectations based on abstract theories, it allows us to assess them based on their ability to accurately reproduce and utilize the patterns found in real-world language data, which is ultimately what makes them useful.
Abstract
Linguistic commentary on LLMs, heavily influenced by the theoretical frameworks of de Saussure and Chomsky, is often speculative and unproductive. Critics challenge whether LLMs can legitimately model language, citing the need for "deep structure" or "grounding" to achieve an idealized linguistic "competence." We argue for a radical shift in perspective towards the empiricist principles of Witold Ma\'nczak, a prominent general and historical linguist. He defines language not as a "system of signs" or a "computational system of the brain" but as the totality of all that is said and written. Above all, he identifies frequency of use of particular language elements as language's primary governing principle. Using his framework, we challenge prior critiques of LLMs and provide a constructive guide for designing, evaluating, and interpreting language models.