GEMS: Agent-Native Multimodal Generation with Memory and Skills

Zefeng He, Siyuan Huang, Xiaoye Qu, Yafu Li, Tong Zhu, Yu Cheng, Yang Yang

2026-04-01

GEMS: Agent-Native Multimodal Generation with Memory and Skills

Summary

This paper introduces GEMS, a new framework designed to improve how AI models generate content, especially when given complicated instructions or asked to perform specific tasks.

What's the problem?

Current AI models are really good at general tasks like writing stories or creating images, but they often struggle when asked to do something complex or specialized, like following detailed instructions for a specific design or solving a technical problem. They lack the ability to really 'think through' a problem step-by-step and remember what they've already tried.

What's the solution?

The researchers created GEMS, which works like a team of AI 'agents' that collaborate to generate better results. It has three main parts: an 'Agent Loop' that constantly refines the output, an 'Agent Memory' that remembers past attempts and learnings to avoid repeating mistakes, and 'Agent Skills' which are like specialized tools the AI can use for different tasks. This system allows the AI to break down complex problems, learn from its experiences, and apply the right expertise when needed.

Why it matters?

This work is important because it shows how to make even smaller AI models perform at a higher level, sometimes even better than larger, more powerful models. It demonstrates that a smart framework, like GEMS, can significantly boost an AI's capabilities and make it more useful for a wider range of real-world applications, meaning we don't always need massive models to get great results.

Abstract

Recent multimodal generation models have achieved remarkable progress on general-purpose generation tasks, yet continue to struggle with complex instructions and specialized downstream tasks. Inspired by the success of advanced agent frameworks such as Claude Code, we propose GEMS (Agent-Native Multimodal GEneration with Memory and Skills), a framework that pushes beyond the inherent limitations of foundational models on both general and downstream tasks. GEMS is built upon three core components. Agent Loop introduces a structured multi-agent framework that iteratively improves generation quality through closed-loop optimization. Agent Memory provides a persistent, trajectory-level memory that hierarchically stores both factual states and compressed experiential summaries, enabling a global view of the optimization process while reducing redundancy. Agent Skill offers an extensible collection of domain-specific expertise with on-demand loading, allowing the system to effectively handle diverse downstream applications. Across five mainstream tasks and four downstream tasks, evaluated on multiple generative backends, GEMS consistently achieves significant performance gains. Most notably, it enables the lightweight 6B model Z-Image-Turbo to surpass the state-of-the-art Nano Banana 2 on GenEval2, demonstrating the effectiveness of agent harness in extending model capabilities beyond their original limits.

View Paper