A word embedding library based on the continuous bag of words model from word2vec.
The Word2Vec class generates word embeddings from the provided corpus using unsupervised learning.
/* Create Word2Vec Instance */
std::vector<std::string> corpus = { "a", "very", "long", "list", "of", "words", "for", "training" };
std::size_t contextWindowSize = 2;
std::size_t negativeSampleCount = 4;
std::size_t embedDimensions = 20;
Word2Vec myWord2Vec(
corpus,
contextWindowSize,
negativeSampleCount,
embedDimensions
);/* Train One Epoch */
myWord2Vec.trainStochasticEpoch(learningRate);/* Post Process Embeddings */
myWord2Vec.postProcess();/* View Embedding Vectors */
std::vector<float> kingEmbedding = myWord2Vec.getEmbedding("king");/* Find Similar By Embedding */
std::vector<std::string> nMostSimilarToEmbedding = myWord2Vec.findSimilarToEmbedding(embedding, n);/* Find Similar Words */
std::string word = "cat";
std::vector<std::string> nMostSimilarToWord = myWord2Vec.findSimilarToWord(word, n);/* Find Similar Words To Composition */
std::vector<std::pair<std::string, float>> composition = {
{ "king", 1.0 },
{ "woman", 1.0 },
{ "man", -1.0 }
};
std::vector<std::string> nMostSimilarToComposition = myWord2Vec.findSimilarToLinearComposition(composition, n);/* Save Model Parameters */
myWord2Vec.save("path/to/backup");/* Load Model Parameters */
myWord2Vec.load("path/to/backup");The example model was trained on the full Text8 corpus, an opensource 2006 Wikipedia snapshot.
The model uses 150 dimension embeddings, a ±4 word context window, and a negative sample count of 10. The learning rate is fixed at 0.02.
| Epoch | Similarity Tests | Composition Tests |
|---|---|---|
| 0 | dog → trepp, kunsthistorisches, cornercopia police → eigenji, arterious, ballymena red → nomeansno, aerospacelegacyfoundation, volli tree → altsasu, mayottensis, saccharalis house → plyoffs, chfp, gpj |
water + frozen → auditore, utukki, outputwait king + woman - man → lucy, queynte, urkizu plant + tall + wood → pocomoke, drawling, citg nature - inside → sandoy, shaffers, kinchiltun paris + italy - france → amereon, shangugu, bivalve |
| 25 | dog → cat, dogs, baby police → military, officer, officers red → blue, yellow, black tree → trees, flowers, garden house → palace, court, home |
water + frozen → dry, wet, snow king + woman - man → queen, prince, wife plant + tall + wood → fish, sand, water nature - inside → temperament, mastery, grotesqueries paris + italy - france → milan, london, venice |
| 50 | dog → cat, bird, horse police → military, officers, guards red → blue, yellow, green tree → trees, fish, leaf house → palace, court, castle |
water + frozen → salt, dry, fish king + woman - man → queen, prince, princess plant + tall + wood → stone, fish, water nature - inside → morality, zoology, temperament paris + italy - france → venice, milan, berlin |
| 100 | dog → dogs, cat, horse police → military, officers, officer red → blue, yellow, green tree → trees, flowers, garden house → hall, room, houses |
water + frozen → dry, fish, ice king + woman - man → queen, princess, prince plant + tall + wood → fish, plants, stone nature - inside → morality, altruism, paradoxologia paris + italy - france → milan, venice, rome |