Image-text retrieval aims to align image regions with textual words for semantic matching, facilitating bidirectional retrieval between images and texts. While significant progress has been made in modeling both coarse-grained image-sentence and fine-grained region-word relationships, fully capturing multi granularity correspondences remains a challenge. Many existing methods predominantly depend on region-level segmentation or recognition, which tends to introduce noise, compromise semantic consistency, and increase computational complexity, ultimately limiting retrieval performance. To address these issues, we propose a Multi-constraint Relational Semantic Alignment (McRSA) method, which incorporates three complementary loss-based constraints to enhance multi-granularity alignment while preserving complete information. Specifically, the method includes Posterior Probability Estimation (PPE), which utilizes Bayesian analysis to model causal relationships between image-text feature pairs and labels, reducing intra-class variations for fine-grained alignment. Additionally, a Momentum-driven Centroid Update (MCU) mechanism is introduced to mitigate oscillations and improve modal consistency in coarse-grained representations. A dynamic Feature Scale Adaptation (FSA) module is also employed, adjusting feature scales across modalities to alleviate granularity discrepancies and improve alignment robustness. Extensive experiments on five public datasets (Flickr30K, MS-COCO, RSTPReid, CUHK-PEDES, and ICFG-PEDES) demonstrate that McRSA achieves competitive retrieval performance compared to existing methods. Code and pre-trained models are available at https://github.com/xiaoyiseu/McRSA.
Discussion(0)
No comments yet. Be the first to comment.