MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills

Niladri Shekhar Dutt, Duygu Ceylan, Niloy J. Mitra

2025-05-13

MonetGPT: Solving Puzzles Enhances MLLMs' Image Retouching Skills

Summary

This paper talks about MonetGPT, an advanced AI model that can both understand and edit photos by following step-by-step instructions, making sure the main objects in the pictures stay the same.

What's the problem?

The problem is that most AI models that edit images either can't explain what they're doing or end up changing important parts of the picture, like people's faces or key objects, which isn't always what users want.

What's the solution?

The researchers trained MonetGPT to solve visual puzzles and then use that skill to suggest and make edits to photos in a way that's easy to understand. The model can explain its editing steps and ensures that the main parts of the image aren't changed accidentally.

Why it matters?

This matters because it helps people get better, more reliable photo edits from AI, with clear explanations for each change. This is useful for artists, photographers, and anyone who wants to improve their pictures without losing important details.

Abstract

A multimodal large language model is trained to suggest and apply procedural edits to photographs, preserving object identity and providing explainable results.

View Paper