Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation

Kelin Ren, Chan-Yang Ju, Dong-Ho Lee

2025-09-12

Modality Alignment with Multi-scale Bilateral Attention for Multimodal Recommendation

Summary

This paper introduces a new recommendation system called MambaRec, designed to better suggest items to users by understanding both what they've liked before and the details of the items themselves, like images and descriptions.

What's the problem?

Current recommendation systems that use multiple types of information, like pictures and text, often struggle with two main issues. First, they don't always do a great job of connecting the fine details between different types of information – for example, understanding how a specific object in an image relates to a word in the item's description. Second, they can develop biases in how they represent information, meaning they don't consistently understand the overall patterns across different types of data.

What's the solution?

MambaRec tackles these problems with a two-pronged approach. It uses a special module called DREAM, which uses a technique similar to how image recognition software works, to carefully align the details between images and text, recognizing patterns and relationships. Simultaneously, it uses mathematical techniques to ensure that the way images and text are represented is consistent and avoids biases. To make the system work efficiently with lots of data, they also reduced the complexity of the information being processed.

Why it matters?

This research is important because it improves the accuracy and reliability of recommendation systems. Better recommendations mean a more satisfying experience for users on e-commerce sites and content platforms, potentially leading to increased sales or engagement. The improvements in efficiency also mean these systems can handle larger amounts of data without slowing down.

Abstract

Multimodal recommendation systems are increasingly becoming foundational technologies for e-commerce and content platforms, enabling personalized services by jointly modeling users' historical behaviors and the multimodal features of items (e.g., visual and textual). However, most existing methods rely on either static fusion strategies or graph-based local interaction modeling, facing two critical limitations: (1) insufficient ability to model fine-grained cross-modal associations, leading to suboptimal fusion quality; and (2) a lack of global distribution-level consistency, causing representational bias. To address these, we propose MambaRec, a novel framework that integrates local feature alignment and global distribution regularization via attention-guided learning. At its core, we introduce the Dilated Refinement Attention Module (DREAM), which uses multi-scale dilated convolutions with channel-wise and spatial attention to align fine-grained semantic patterns between visual and textual modalities. This module captures hierarchical relationships and context-aware associations, improving cross-modal semantic modeling. Additionally, we apply Maximum Mean Discrepancy (MMD) and contrastive loss functions to constrain global modality alignment, enhancing semantic consistency. This dual regularization reduces mode-specific deviations and boosts robustness. To improve scalability, MambaRec employs a dimensionality reduction strategy to lower the computational cost of high-dimensional multimodal features. Extensive experiments on real-world e-commerce datasets show that MambaRec outperforms existing methods in fusion quality, generalization, and efficiency. Our code has been made publicly available at https://github.com/rkl71/MambaRec.

View Paper