Refining Contrastive Learning and Homography Relations for Multi-Modal Recommendation
Shouxing Ma, Yawen Zeng, Shiqing Wu, Guandong Xu
2025-08-21
Summary
This paper introduces a new method called REARM to make movie and product recommendations better by using both pictures and text descriptions of items. It improves on previous methods that used graphs to understand relationships between users and items, but struggled with limited data.
What's the problem?
Existing recommendation systems that combine image and text information are often not great when there isn't a lot of data. Current techniques that try to use contrastive learning and group similar items together still have issues. They either mess up the shared information between different types of data (like text and images) or lose unique details, and they don't fully capture how a user's taste relates to what items are often liked together.
What's the solution?
REARM tackles these problems by making the way it learns from different types of data smarter. It uses special techniques called meta-networks and orthogonal constraints to clean up the shared information and keep the important unique details. It also builds a more complete picture of how users like things and how items are related by combining different types of graphs, including ones that show user interests and item popularity.
Why it matters?
By solving these issues, REARM helps create more accurate and personalized recommendations, especially in situations where data is scarce. This means users are more likely to find things they'll enjoy, leading to a better overall experience with recommendation platforms.
Abstract
Multi-modal recommender system focuses on utilizing rich modal information ( i.e., images and textual descriptions) of items to improve recommendation performance. The current methods have achieved remarkable success with the powerful structure modeling capability of graph neural networks. However, these methods are often hindered by sparse data in real-world scenarios. Although contrastive learning and homography ( i.e., homogeneous graphs) are employed to address the data sparsity challenge, existing methods still suffer two main limitations: 1) Simple multi-modal feature contrasts fail to produce effective representations, causing noisy modal-shared features and loss of valuable information in modal-unique features; 2) The lack of exploration of the homograph relations between user interests and item co-occurrence results in incomplete mining of user-item interplay. To address the above limitations, we propose a novel framework for REfining multi-modAl contRastive learning and hoMography relations (REARM). Specifically, we complement multi-modal contrastive learning by employing meta-network and orthogonal constraint strategies, which filter out noise in modal-shared features and retain recommendation-relevant information in modal-unique features. To mine homogeneous relationships effectively, we integrate a newly constructed user interest graph and an item co-occurrence graph with the existing user co-occurrence and item semantic graphs for graph learning. The extensive experiments on three real-world datasets demonstrate the superiority of REARM to various state-of-the-art baselines. Our visualization further shows an improvement made by REARM in distinguishing between modal-shared and modal-unique features. Code is available https://github.com/MrShouxingMa/REARM{here}.