PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval

Jiancheng Pan, Muyuan Ma, Qing Ma, Cong Bai✉, Shengyong Chen

* Corresponding Author ✉

News | Abstract | Model | Dataset | Statement

News

[2024/5/17] We have open-sourced the code of PriorCLIP.
[2024/5/16] Our paper, "PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval," is up on arXiv.

Abstract

Remote sensing image-text retrieval plays a crucial role in remote sensing interpretation, yet remains challenging under both closed-domain and open-domain scenarios due to semantic noise and domain shifts. To address these issues, we propose a visual prior-guided vision-language model, PriorCLIP, which leverages visual priors for unbiased representation learning and adaptive vision-language alignment. In the closed-domain setting, PriorCLIP introduces two Progressive Attention Encoder (PAE) structures: Spatial-PAE constructs a belief matrix with instruction embeddings to filter key features and mitigate semantic bias. At the same time, Temporal-PAE exploits cyclic activation across time steps to enhance text representation. For the open-domain setting, we design a two-stage prior representation learning strategy, consisting of large-scale pre-training on coarse-grained image-text pairs, followed by fine-tuning on fine-grained pairs using vision-instruction, which enables robust retrieval across long-tail concepts and vocabulary shifts. Furthermore, a cluster-based symmetric contrastive Attribution Loss is proposed to constrain inter-class relations and alleviate semantic confusion in the shared embedding space. Extensive experiments on RSICD and RSITMD benchmarks demonstrate that PriorCLIP achieves substantial improvements, outperforming existing methods by 4.9% and 4.0% in closed-domain retrieval, and by 7.3% and 9.4% in open-domain retrieval, respectively.

Model

Environments

base on open_clip environments, you can click here open_clip.

Train

If using Affiliation loss, add is_aff_loss where the label information is obtained by image_name from datasets. For example, we can train PriorCLIP using the follow commad:

python -m training.main \
    --save-frequency 1 \
    --report-to tensorboard \
    --train-data="path/to/webdataset/tar" \
    --dataset-resampled \
    --train-num-samples num_dataset \
    --dataset-type webdataset \
    --warmup 10000 \
    --batch-size=512\
    --precision amp \
    --lr=1e-5 \
    --wd=0.5 \
    --epochs=20 \
    --workers=4 \
    --model=PIR \
    --is_aff_loss

or parallel training as

torchrun --nproc_per_node 2 \
    --rdzv_endpoint=$HOSTE_NODE_ADDR \
    -m training.main \
    --save-frequency 1 \
    ...

Retrieval

Retrieval evaluation on CLIP Benchmark and checkpoints can download from here: Baidu Disk.

python retrieval.py \
    --model-name "PIR" \
    --retrieval-images-dir "path/to/images" \
    --retrieval-json-dir "path/to/dataset.json" \
    --remoteclip-path "./checkpoints/PriorCLIP_RET-3.pt"

Dataset

All experiments are based on RSITMD, RSICD datasets and pre-training dataset RS5M.

Statement

Acknowledgement

This project references and utilizes the following open-source models and datasets.

Related Open Source Models

OpenCLIP
RemoteCLIP(part)

Related Open Source Datasets

Citation

If you are interested in the following work, please cite the following paper.

@inproceedings{pan2023prior,
  title={A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval},
  author={Pan, Jiancheng and Ma, Qing and Bai, Cong},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={611--620},
  year={2023}
}

@misc{pan2024pir,
      title={PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval}, 
      author={Jiancheng Pan and Muyuan Ma and Qing Ma and Cong Bai and Shengyong Chen},
      year={2024},
      eprint={2405.10160},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
checkpoints		checkpoints
docs		docs
scripts		scripts
src		src
tests		tests
tutorials		tutorials
.DS_Store		.DS_Store
.gitignore		.gitignore
HISTORY.md		HISTORY.md
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
Makefile		Makefile
README.md		README.md
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
requirements-test.txt		requirements-test.txt
requirements-training.txt		requirements-training.txt
requirements.txt		requirements.txt
retrieval.py		retrieval.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval

Jiancheng Pan, Muyuan Ma, Qing Ma, Cong Bai✉, Shengyong Chen

News

Abstract

Model

Environments

Train

Retrieval

Dataset

Statement

Acknowledgement

Related Open Source Models

Related Open Source Datasets

Citation

About

Uh oh!

Releases

Packages

Languages

License

jaychempan/PriorCLIP

Folders and files

Latest commit

History

Repository files navigation

PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval

Jiancheng Pan, Muyuan Ma, Qing Ma, Cong Bai✉, Shengyong Chen

News

Abstract

Model

Environments

Train

Retrieval

Dataset

Statement

Acknowledgement

Related Open Source Models

Related Open Source Datasets

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages