ArXiv-to-Model: A Practical Study of Scientific LM Training

Anuj Gupta

2026-02-20

ArXiv-to-Model: A Practical Study of Scientific LM Training

Summary

This paper details the practical steps and challenges involved in creating a specialized language model for scientific fields like math, computer science, and physics. It's a 'how-to' guide for building these models, rather than introducing a brand new model design.

What's the problem?

While powerful language models exist, it's not always clear *how* to train a new one specifically for scientific topics, starting from the raw research papers themselves. There's a lack of detailed information about the entire process, from getting the data to actually training the model, especially when you don't have access to massive computing resources.

What's the solution?

The researchers built a 1.36 billion parameter language model using research papers from arXiv. They carefully documented each step: filtering the papers, converting the LaTeX code into readable text, preparing the text for the model, and finally, training the model using a limited amount of computing power (just two A100 GPUs). They ran the training process 24 times, carefully tracking what worked, what didn't, and where the bottlenecks were, like slow storage or data processing.

Why it matters?

This work is important because it provides a realistic and transparent guide for researchers who want to build their own specialized scientific language models but don't have huge budgets for computing. It shows what to expect, what decisions matter most, and how to overcome common obstacles, making this technology more accessible.

Abstract

While frontier large language models demonstrate strong reasoning and mathematical capabilities, the practical process of training domain-specialized scientific language models from raw sources remains under-documented. In this work, we present a detailed case study of training a 1.36B-parameter scientific language model directly from raw arXiv LaTeX sources spanning mathematics, computer science, and theoretical physics. We describe an end-to-end pipeline covering metadata filtering, archive validation, LaTeX extraction, text normalization, domain-aware tokenization, and dense transformer training under constrained compute (2xA100 GPUs). Through 24 experimental runs, we analyze training stability, scaling behavior, data yield losses, and infrastructure bottlenecks. Our findings highlight how preprocessing decisions significantly affect usable token volume, how tokenization impacts symbolic stability, and how storage and I/O constraints can rival compute as limiting factors. We further analyze convergence dynamics and show stable training behavior in a data-rich regime (52B pretraining tokens). Rather than proposing a novel architecture, this work provides an engineering-grounded, transparent account of training a small scientific language model from scratch. We hope these insights support researchers operating under moderate compute budgets who seek to build domain-specialized models.

View Paper