MoTE: Mixture of Ternary Experts for Memory-efficient Large Multimodal Models
Hongyu Wang, Jiayu Xu, Ruiping Wang, Yan Feng, Yitao Zhai, Peng Pei, Xunliang Cai, Xilin Chen
2025-06-19
Summary
This paper talks about MoTE, a new way to make large multimodal AI models run faster and use less memory by combining many small, simple expert parts that work together using ternary precision instead of full precision.
What's the problem?
The problem is that large AI models, especially those that process multiple types of information like images and text, usually need a lot of memory and computing power, making them hard to use on smaller devices like phones or robots.
What's the solution?
The researchers designed MoTE, which uses a mixture of experts where each expert works with low-precision ternary values (only -1, 0, or 1) to save memory and improve efficiency. This lets the model keep high performance while being small enough to run on edge devices.
Why it matters?
This matters because it makes powerful multimodal AI models more accessible on everyday devices by reducing their memory needs, helping create smarter apps and robots that can operate without needing huge computers.
Abstract
MoTE, a scalable and memory-efficient method, improves Mixture-of-Experts models using low-precision ternary experts, enhancing performance and reducing memory footprint for deployment on edge devices.