* Corresponding Author ✉
News | Abstract | Model | Dataset | Statement
- [2024/5/17] We have open-sourced the code of PriorCLIP.
- [2024/5/16] Our paper, "PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval," is up on arXiv.
Remote sensing image-text retrieval plays a crucial role in remote sensing interpretation, yet remains challenging under both closed-domain and open-domain scenarios due to semantic noise and domain shifts. To address these issues, we propose a visual prior-guided vision-language model, PriorCLIP, which leverages visual priors for unbiased representation learning and adaptive vision-language alignment. In the closed-domain setting, PriorCLIP introduces two Progressive Attention Encoder (PAE) structures: Spatial-PAE constructs a belief matrix with instruction embeddings to filter key features and mitigate semantic bias. At the same time, Temporal-PAE exploits cyclic activation across time steps to enhance text representation. For the open-domain setting, we design a two-stage prior representation learning strategy, consisting of large-scale pre-training on coarse-grained image-text pairs, followed by fine-tuning on fine-grained pairs using vision-instruction, which enables robust retrieval across long-tail concepts and vocabulary shifts. Furthermore, a cluster-based symmetric contrastive Attribution Loss is proposed to constrain inter-class relations and alleviate semantic confusion in the shared embedding space. Extensive experiments on RSICD and RSITMD benchmarks demonstrate that PriorCLIP achieves substantial improvements, outperforming existing methods by 4.9% and 4.0% in closed-domain retrieval, and by 7.3% and 9.4% in open-domain retrieval, respectively.
base on open_clip environments, you can click here open_clip.
If using Affiliation loss, add is_aff_loss where the label information is obtained by image_name from datasets. For example, we can train PriorCLIP using the follow commad:
python -m training.main \
--save-frequency 1 \
--report-to tensorboard \
--train-data="path/to/webdataset/tar" \
--dataset-resampled \
--train-num-samples num_dataset \
--dataset-type webdataset \
--warmup 10000 \
--batch-size=512\
--precision amp \
--lr=1e-5 \
--wd=0.5 \
--epochs=20 \
--workers=4 \
--model=PIR \
--is_aff_loss
or parallel training as
torchrun --nproc_per_node 2 \
--rdzv_endpoint=$HOSTE_NODE_ADDR \
-m training.main \
--save-frequency 1 \
...
Retrieval evaluation on CLIP Benchmark and checkpoints can download from here: Baidu Disk.
python retrieval.py \
--model-name "PIR" \
--retrieval-images-dir "path/to/images" \
--retrieval-json-dir "path/to/dataset.json" \
--remoteclip-path "./checkpoints/PriorCLIP_RET-3.pt"
All experiments are based on RSITMD, RSICD datasets and pre-training dataset RS5M.
This project references and utilizes the following open-source models and datasets.
- OpenCLIP
- RemoteCLIP(part)
If you are interested in the following work, please cite the following paper.
@inproceedings{pan2023prior,
title={A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval},
author={Pan, Jiancheng and Ma, Qing and Bai, Cong},
booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
pages={611--620},
year={2023}
}
@misc{pan2024pir,
title={PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval},
author={Jiancheng Pan and Muyuan Ma and Qing Ma and Cong Bai and Shengyong Chen},
year={2024},
eprint={2405.10160},
archivePrefix={arXiv},
primaryClass={cs.CV}
}

