In your paper, it says
At first, we encode TB into an embedded vector vB either by many-hot encoding or using a pre-trained NLP model such as BERT, FastText, or Word2Vec
Can you tell me exactly which text encoding model you are using in your released code? Could you release the encoding model for custom images?