TESS 2: A Large-Scale Generalist Diffusion Language Model

Jaesung Tae, Hamish Ivison, Sachin Kumar, Arman Cohan

2025-02-20

TESS 2: A Large-Scale Generalist Diffusion Language Model

Summary

This paper talks about TESS 2, a new type of AI language model that uses a technique called diffusion to understand and follow instructions. It's like a super-smart digital assistant that can learn and improve its performance by thinking through problems in a unique way.

What's the problem?

Current AI language models are good at following instructions, but they often use a method that limits how much they can improve when given more time to think. It's like having a smart friend who can answer questions quickly, but can't give better answers even if you give them more time to consider.

What's the solution?

The researchers created TESS 2 by taking a powerful existing AI model and teaching it to 'think' using a diffusion process, which is like slowly refining ideas over time. They also came up with a way to guide the model's answers using rewards, helping it give better responses without having to retrain the whole system. TESS 2 can also improve its answers if given more time to think, which other models can't do as easily.

Why it matters?

This matters because it could lead to AI assistants that are not only smarter but also more flexible. They could give quick answers when needed, but also provide more thoughtful and accurate responses when given more time. This could be really useful in fields like education, research, or any area where the quality of information matters more than speed. It's a step towards AI that can adapt its thinking process to different situations, just like humans do.

Abstract

We introduce TESS 2, a general instruction-following diffusion language model that outperforms contemporary instruction-tuned diffusion models, as well as matches and sometimes exceeds strong autoregressive (AR) models. We train TESS 2 by first adapting a strong AR model via continued pretraining with the usual cross-entropy as diffusion loss, and then performing further instruction tuning. We find that adaptation training as well as the choice of the base model is crucial for training good instruction-following diffusion models. We further propose reward guidance, a novel and modular inference-time guidance procedure to align model outputs without needing to train the underlying model. Finally, we show that TESS 2 further improves with increased inference-time compute, highlighting the utility of diffusion LMs in having fine-grained controllability over the amount of compute used at inference time. Code and models are available at https://github.com/hamishivi/tess-2.

View Paper