This paper was accepted for publication in IEEE Transactions on Circuits and Systems for Video Technology (https://ieeexplore.ieee.org/document/10311392/).
Our code and configurations are being organized.
Recent advances in semantic correspondence have witnessed growing interest in SD and DINO. However, existing methods underutilize the matching potential of SD and DINOv2 features and show similar background interference patterns. They lack texture-to-semantic learning and intra- and inter-image feature interaction. This study proposes Tex2Sem, a framework learning from textures to semantics, to address the two problems. For the first problem, we propose a texture-to-semantic learning paradigm that achieves texture-semantic trade-offs on features and correlation maps. SD and DINOv2 features are aggregated from textures to semantics to produce multi-stage progressive fusion features. For the second problem, MamFormer, a hybrid architecture of Mamba-2 and Transformer, is proposed to improve intra- and inter-image feature aggregation and interaction. The terminal-stage aggregation and interaction mechanism (TAIM) is proposed to enhance feature learning efficiency.
@ARTICLE{Wang2025Tex2Sem,
author={Wang, Zenghui and Du, Songlin and Yan, Yaping and Xiao, Guobao and Lu, Xiaobo},
journal={IEEE Transactions on Circuits and Systems for Video Technology},
title={Tex2Sem: Learning from Textures to Semantics for Robust Semantic Correspondence},
year={2025},
volume={35},
number={11},
pages={10875-10890},
doi={10.1109/TCSVT.2025.3576772}}


