MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Wan-Cyuan Fan, Tanzila Rahman, Leonid Sigal

2024-12-27

MMFactory: A Universal Solution Search Engine for Vision-Language Tasks

Summary

This paper talks about MMFactory, a new system designed to help users find the best solutions for various vision-language tasks by acting like a search engine that combines different models and techniques.

What's the problem?

As technology advances, many models have been created to handle visual tasks, but no single model can effectively tackle every task. Existing methods often overlook user needs, such as performance requirements or ease of use, and can produce solutions that are difficult to implement. This makes it challenging for non-experts to get the help they need for specific tasks.

What's the solution?

To solve these problems, the authors developed MMFactory, which is a universal framework that suggests multiple solutions based on user-defined tasks and examples. It integrates different types of data and models to provide a diverse range of programmatic solutions. Users can input a task description along with some examples, and MMFactory will generate possible solutions while also considering the user's performance and resource constraints. The system uses a committee-based approach where multiple agents work together to create robust solutions that are easy to deploy.

Why it matters?

This research is important because it makes advanced technology more accessible to everyone, even those without technical expertise. By providing tailored solutions for various tasks, MMFactory can enhance how users interact with AI systems in fields like computer vision and natural language processing. This could lead to better applications in areas such as education, healthcare, and content creation.

Abstract

With advances in foundational and vision-language models, and effective fine-tuning techniques, a large number of both general and special-purpose models have been developed for a variety of visual tasks. Despite the flexibility and accessibility of these models, no single model is able to handle all tasks and/or applications that may be envisioned by potential users. Recent approaches, such as visual programming and multimodal LLMs with integrated tools aim to tackle complex visual tasks, by way of program synthesis. However, such approaches overlook user constraints (e.g., performance / computational needs), produce test-time sample-specific solutions that are difficult to deploy, and, sometimes, require low-level instructions that maybe beyond the abilities of a naive user. To address these limitations, we introduce MMFactory, a universal framework that includes model and metrics routing components, acting like a solution search engine across various available models. Based on a task description and few sample input-output pairs and (optionally) resource and/or performance constraints, MMFactory can suggest a diverse pool of programmatic solutions by instantiating and combining visio-lingual tools from its model repository. In addition to synthesizing these solutions, MMFactory also proposes metrics and benchmarks performance / resource characteristics, allowing users to pick a solution that meets their unique design constraints. From the technical perspective, we also introduced a committee-based solution proposer that leverages multi-agent LLM conversation to generate executable, diverse, universal, and robust solutions for the user. Experimental results show that MMFactory outperforms existing methods by delivering state-of-the-art solutions tailored to user problem specifications. Project page is available at https://davidhalladay.github.io/mmfactory_demo.

View Paper