Sessa: Selective State Space Attention

Liubomyr Horbatko

2026-04-27

Summary

This paper introduces Sessa, a new type of decoder model for processing sequences of data, like text, that aims to improve how models handle very long inputs.

What's the problem?

Current leading models for sequence processing, Transformers and structured state-space models, struggle with long sequences. Transformers can get bogged down when trying to pay attention to everything, making it hard to focus on important details. Structured state-space models can 'forget' information over long distances unless they specifically work to remember it. Essentially, both have trouble remembering and retrieving the right information when dealing with lengthy inputs.

What's the solution?

The researchers created Sessa, which combines the strengths of both approaches. It uses attention mechanisms *within* a recurrent feedback loop. Imagine a loop where information constantly cycles, and at each cycle, attention helps decide what parts of the past are most relevant. This creates multiple pathways for information to flow, making it easier to preserve and access details from earlier in the sequence. They also mathematically proved that Sessa can remember information for a longer time and retrieve it more flexibly than existing models.

Why it matters?

Sessa represents a step forward in handling long-context tasks, like understanding very long documents or conversations. Because it performs better on long sequences while still being competitive on shorter ones, it could lead to improvements in many areas of artificial intelligence, especially those requiring models to process and understand extensive amounts of information.

Abstract

Modern sequence modeling is dominated by two families: Transformers, whose self-attention can access arbitrary elements of the visible sequence, and structured state-space models, which propagate information through an explicit recurrent state. These mechanisms face different limitations on long contexts: when attention is diffuse, the influence of individual tokens is diluted across the effective support, while recurrent state propagation can lose long-range sensitivity unless information is actively preserved. As a result, both mechanisms face challenges in preserving and selectively retrieving information over long contexts. We propose Sessa, a decoder that places attention inside a recurrent feedback path. This creates many attention-based paths through which past tokens can influence future states, rather than relying on a single attention read or a single recurrent chain. We prove that, under explicit assumptions and matched regimes, Sessa admits power-law memory tails O(ell^{-β}) for 0 < β< 1, with slower decay than in the corresponding Transformer and Mamba-style baselines. We further give an explicit construction that achieves this power-law rate. Under the same assumptions, Sessa is the only model class among those considered that realizes flexible selective retrieval, including profiles whose influence does not decay with distance. Consistent with this theoretical advantage, across matched experiments, Sessa achieves the strongest performance on long-context benchmarks while remaining competitive with Transformer and Mamba-style baselines on short-context language modeling.

View Paper