Arcee Trinity Large Technical Report

Varun Singh, Lucas Krauss, Sami Jaghouar, Matej Sirovatka, Charles Goddard, Fares Obied, Jack Min Ong, Jannik Straube, Fern, Aria Harley, Conner Stewart, Colin Kealty, Maziyar Panahi, Simon Kirsten, Anushka Deshpande, Anneketh Vij, Arthur Bresnu, Pranav Veldurthi, Raghav Ravishankar, Hardik Bishnoi, DatologyAI Team, Arcee AI Team

2026-02-20

Summary

This paper introduces three new language models – Arcee Trinity Large, Trinity Nano, and Trinity Mini – which are built using a 'Mixture-of-Experts' approach, meaning they use different parts of the network for different tasks. The models vary in size, with Trinity Large being the biggest and Trinity Nano the smallest.

What's the problem?

Building really large language models is hard because they require a ton of computing power and memory. Traditional models become inefficient as they grow, meaning it takes longer to process information. Also, getting these massive models to train *without* errors (like sudden loss spikes) is a significant challenge.

What's the solution?

The researchers created these models using a special architecture that includes techniques like combining local and global attention, a new way to manage how the 'experts' within the model are used (called SMEBU for the largest model), and a new optimization method called Muon. They trained the models on huge amounts of text data – 10 to 17 trillion words – and made the trained models publicly available.

Why it matters?

These models demonstrate a way to build very large language models that are more efficient than previous approaches. By activating only a small portion of the network at a time, they reduce the computational cost. The fact that they trained successfully without errors is also a big step forward, and making the models open-source allows other researchers to build upon this work.

Abstract

We present the technical report for Arcee Trinity Large, a sparse Mixture-of-Experts model with 400B total parameters and 13B activated per token. Additionally, we report on Trinity Nano and Trinity Mini, with Trinity Nano having 6B total parameters with 1B activated per token, Trinity Mini having 26B total parameters with 3B activated per token. The models' modern architecture includes interleaved local and global attention, gated attention, depth-scaled sandwich norm, and sigmoid routing for Mixture-of-Experts. For Trinity Large, we also introduce a new MoE load balancing strategy titled Soft-clamped Momentum Expert Bias Updates (SMEBU). We train the models using the Muon optimizer. All three models completed training with zero loss spikes. Trinity Nano and Trinity Mini were pre-trained on 10 trillion tokens, and Trinity Large was pre-trained on 17 trillion tokens. The model checkpoints are available at https://huggingface.co/arcee-ai.

View Paper