Skip to content

Official Code for “PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval”

License

Notifications You must be signed in to change notification settings

jaychempan/PriorCLIP

Repository files navigation

Image

PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval

Jiancheng Pan,     Muyuan Ma,     Qing Ma,     Cong Bai✉,     Shengyong Chen

* Corresponding Author ✉

News | Abstract | Model | Dataset | Statement

News

  • [2024/5/17] We have open-sourced the code of PriorCLIP.
  • [2024/5/16] Our paper, "PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval," is up on arXiv.

Abstract

Remote sensing image-text retrieval plays a crucial role in remote sensing interpretation, yet remains challenging under both closed-domain and open-domain scenarios due to semantic noise and domain shifts. To address these issues, we propose a visual prior-guided vision-language model, PriorCLIP, which leverages visual priors for unbiased representation learning and adaptive vision-language alignment. In the closed-domain setting, PriorCLIP introduces two Progressive Attention Encoder (PAE) structures: Spatial-PAE constructs a belief matrix with instruction embeddings to filter key features and mitigate semantic bias. At the same time, Temporal-PAE exploits cyclic activation across time steps to enhance text representation. For the open-domain setting, we design a two-stage prior representation learning strategy, consisting of large-scale pre-training on coarse-grained image-text pairs, followed by fine-tuning on fine-grained pairs using vision-instruction, which enables robust retrieval across long-tail concepts and vocabulary shifts. Furthermore, a cluster-based symmetric contrastive Attribution Loss is proposed to constrain inter-class relations and alleviate semantic confusion in the shared embedding space. Extensive experiments on RSICD and RSITMD benchmarks demonstrate that PriorCLIP achieves substantial improvements, outperforming existing methods by 4.9% and 4.0% in closed-domain retrieval, and by 7.3% and 9.4% in open-domain retrieval, respectively.

pipline

Model

Environments

base on open_clip environments, you can click here open_clip.

Train

If using Affiliation loss, add is_aff_loss where the label information is obtained by image_name from datasets. For example, we can train PriorCLIP using the follow commad:

python -m training.main \
    --save-frequency 1 \
    --report-to tensorboard \
    --train-data="path/to/webdataset/tar" \
    --dataset-resampled \
    --train-num-samples num_dataset \
    --dataset-type webdataset \
    --warmup 10000 \
    --batch-size=512\
    --precision amp \
    --lr=1e-5 \
    --wd=0.5 \
    --epochs=20 \
    --workers=4 \
    --model=PIR \
    --is_aff_loss

or parallel training as

torchrun --nproc_per_node 2 \
    --rdzv_endpoint=$HOSTE_NODE_ADDR \
    -m training.main \
    --save-frequency 1 \
    ...

Retrieval

Retrieval evaluation on CLIP Benchmark and checkpoints can download from here: Baidu Disk.

python retrieval.py \
    --model-name "PIR" \
    --retrieval-images-dir "path/to/images" \
    --retrieval-json-dir "path/to/dataset.json" \
    --remoteclip-path "./checkpoints/PriorCLIP_RET-3.pt"

Dataset

All experiments are based on RSITMD, RSICD datasets and pre-training dataset RS5M.

Statement

Acknowledgement

This project references and utilizes the following open-source models and datasets.

Related Open Source Models

Related Open Source Datasets

Citation

If you are interested in the following work, please cite the following paper.

@inproceedings{pan2023prior,
  title={A Prior Instruction Representation Framework for Remote Sensing Image-text Retrieval},
  author={Pan, Jiancheng and Ma, Qing and Bai, Cong},
  booktitle={Proceedings of the 31st ACM International Conference on Multimedia},
  pages={611--620},
  year={2023}
}

@misc{pan2024pir,
      title={PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval}, 
      author={Jiancheng Pan and Muyuan Ma and Qing Ma and Cong Bai and Shengyong Chen},
      year={2024},
      eprint={2405.10160},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}
  

About

Official Code for “PriorCLIP: Visual Prior Guided Vision-Language Model for Remote Sensing Image-Text Retrieval”

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published