Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

Xuezhe Ma, Shicheng Wen, Linghao Jin, Bilge Acun, Ruihang Lai, Bohan Hou, Will Lin, Hao Zhang, Songlin Yang, Ryan Lee, Mengxi Wu, Jonathan May, Luke Zettlemoyer, Carole-Jean Wu

2026-01-13

Gecko: An Efficient Neural Architecture Inherently Processing Sequences with Arbitrary Lengths

Summary

This paper introduces Gecko, a new neural network design aimed at better processing long sequences of data, like text, compared to existing models like Transformers.

What's the problem?

Current models, particularly those based on the Transformer architecture, struggle with very long sequences. They become computationally expensive—the processing time increases dramatically with length—and they aren't great at understanding relationships between pieces of information that are far apart in the sequence. This limits their ability to handle tasks requiring a large amount of context, like summarizing long documents or answering questions based on extensive information.

What's the solution?

The researchers built Gecko by starting with a foundation from previous models called Mega and Megalodon, which use a clever technique involving averaging and attention. They then added several improvements. These include a way to normalize the data based on its position in the sequence, a method for focusing attention on smaller, sliding chunks of the sequence, and a system for the model to maintain a kind of 'working memory' to remember important information over long distances. They trained Gecko with a lot of data and compared it to other models like Llama2 and Megalodon.

Why it matters?

Gecko shows promising results, achieving better performance and efficiency than other models when dealing with long sequences. It can handle sequences much longer than its core attention mechanism would normally allow, meaning it can process and retrieve information from very large amounts of text without needing special tricks to extend its context window. This is a big step towards building AI systems that can truly understand and work with complex, lengthy information.

Abstract

Designing a unified neural network to efficiently and inherently process sequential data with arbitrary lengths is a central and challenging problem in sequence modeling. The design choices in Transformer, including quadratic complexity and weak length extrapolation, have limited their ability to scale to long sequences. In this work, we propose Gecko, a neural architecture that inherits the design of Mega and Megalodon (exponential moving average with gated attention), and further introduces multiple technical components to improve its capability to capture long range dependencies, including timestep decay normalization, sliding chunk attention mechanism, and adaptive working memory. In a controlled pretraining comparison with Llama2 and Megalodon in the scale of 7 billion parameters and 2 trillion training tokens, Gecko achieves better efficiency and long-context scalability. Gecko reaches a training loss of 1.68, significantly outperforming Llama2-7B (1.75) and Megalodon-7B (1.70), and landing close to Llama2-13B (1.67). Notably, without relying on any context-extension techniques, Gecko exhibits inherent long-context processing and retrieval capabilities, stably handling sequences of up to 4 million tokens and retrieving information from contexts up to 4times longer than its attention window. Code: https://github.com/XuezheMax/gecko-llm

View Paper