Skip to content

mattperls-code/word2vec

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Word2Vec

A word embedding library based on the continuous bag of words model from word2vec.

The Word2Vec class generates word embeddings from the provided corpus using unsupervised learning.

/* Create Word2Vec Instance */

std::vector<std::string> corpus = { "a", "very", "long", "list", "of", "words", "for", "training" };
std::size_t contextWindowSize = 2;
std::size_t negativeSampleCount = 4;
std::size_t embedDimensions = 20;

Word2Vec myWord2Vec(
    corpus,
    contextWindowSize,
    negativeSampleCount,
    embedDimensions
);
/* Train One Epoch */

myWord2Vec.trainStochasticEpoch(learningRate);
/* Post Process Embeddings */

myWord2Vec.postProcess();
/* View Embedding Vectors */

std::vector<float> kingEmbedding = myWord2Vec.getEmbedding("king");
/* Find Similar By Embedding */

std::vector<std::string> nMostSimilarToEmbedding = myWord2Vec.findSimilarToEmbedding(embedding, n);
/* Find Similar Words */

std::string word = "cat";

std::vector<std::string> nMostSimilarToWord = myWord2Vec.findSimilarToWord(word, n);
/* Find Similar Words To Composition */

std::vector<std::pair<std::string, float>> composition = {
    { "king", 1.0 },
    { "woman", 1.0 },
    { "man", -1.0 }
};

std::vector<std::string> nMostSimilarToComposition = myWord2Vec.findSimilarToLinearComposition(composition, n);
/* Save Model Parameters */

myWord2Vec.save("path/to/backup");
/* Load Model Parameters */

myWord2Vec.load("path/to/backup");

Text8 Word Embeddings Example

Training Data

The example model was trained on the full Text8 corpus, an opensource 2006 Wikipedia snapshot.

Model Parameters

The model uses 150 dimension embeddings, a ±4 word context window, and a negative sample count of 10. The learning rate is fixed at 0.02.

Training Progression

Epoch Similarity Tests Composition Tests
0 dog → trepp, kunsthistorisches, cornercopia

police → eigenji, arterious, ballymena

red → nomeansno, aerospacelegacyfoundation, volli

tree → altsasu, mayottensis, saccharalis

house → plyoffs, chfp, gpj
water + frozen → auditore, utukki, outputwait

king + woman - man → lucy, queynte, urkizu

plant + tall + wood → pocomoke, drawling, citg

nature - inside → sandoy, shaffers, kinchiltun

paris + italy - france → amereon, shangugu, bivalve
25 dog → cat, dogs, baby

police → military, officer, officers

red → blue, yellow, black

tree → trees, flowers, garden

house → palace, court, home
water + frozen → dry, wet, snow

king + woman - man → queen, prince, wife

plant + tall + wood → fish, sand, water

nature - inside → temperament, mastery, grotesqueries

paris + italy - france → milan, london, venice
50 dog → cat, bird, horse

police → military, officers, guards

red → blue, yellow, green

tree → trees, fish, leaf

house → palace, court, castle
water + frozen → salt, dry, fish

king + woman - man → queen, prince, princess

plant + tall + wood → stone, fish, water

nature - inside → morality, zoology, temperament

paris + italy - france → venice, milan, berlin
100 dog → dogs, cat, horse

police → military, officers, officer

red → blue, yellow, green

tree → trees, flowers, garden

house → hall, room, houses
water + frozen → dry, fish, ice

king + woman - man → queen, princess, prince

plant + tall + wood → fish, plants, stone

nature - inside → morality, altruism, paradoxologia

paris + italy - france → milan, venice, rome

About

Natural Language Word Embeddings Using CBOW Model

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published