Implement the core training loop of word2vec in pure NumPy (no PyTorch / TensorFlow or other ML frameworks). The applicant is free to choose any suitable text dataset. The task is to implement the optimization procedure (forward pass, loss, gradients, and parameter updates) for a standard word2vec variant (e.g. skip-gram with negative sampling or CBOW).
The project implements the Skip-gram model with Negative Sampling, following the approach described in the original Word2Vec paper.
Well it all started with some notebook prototyping and after formulating the plan I started creating the target project. Along the way there was obviously a lot of refactoring and changes, which made me start over or rewrite a lot. A vast majority of these changes were driven by the fact that I was learning most of the concepts and on the fly, which naturally led to frequent redesigns. Unsurprisingly (though still not exactly welcome) it caused a few nights with less sleep than I would like.
I wanted the code to be quite modular allowing future extension like adding CBOW at a later date. While that is not entirely done there is definitely foundation for that using which I plan to refactor the code in the future. What's most paramount (and also quite obvious) the whole process provided me with a significant amount of new knowledge.
My key personal goal was to make king + woman - man = queen work from the original paper and it does, so achieving this I do count the project as a success.
Well to sum up there is certainly room for improvement, but deadline is impending and I have already fallen behind on my studies, so they need some of my love too. Nevertheless, I do plan to come back to this project at a later date.
Let's find words similar to war
python word2vec.py infer model/embeddings_combined.npy model/vocab.json --similar war
outbreak 0.6947
civil 0.6346
hostilities 0.5884
world 0.5811
mithridatic 0.5791
ii 0.5743
boer 0.5714
crimean 0.5574
wars 0.5555
bregalnica 0.5513And famous equation king + woman - man
python word2vec.py infer model/embeddings_combined.npy model/vocab.json --analogy king woman man
queen 0.6737
regnant 0.6505
consort 0.6236
eurypontid 0.5918
princess 0.5857
suddhodana 0.5820
hedvig 0.5773
athelstan 0.5737
rafohy 0.5712
peuta 0.5707Python 3.13
Of course start with creating a vritual environment for instance using uv or python -m venv
The model files are huge, so git lfs is necessary :)
You can easily download the dataset used using the provided downloader script: dataset_downloader.py
Before training, the raw text format must be converted into a binary numerical format of vocabulary indexes.
python word2vec.py preprocess data/wiki.train.tokens \
--data-dir output \
--val-input data/wiki.valid.tokens \
--min-count 5Outputs: corpus.bin, offsets.npy, and vocab.json inside the specified directory.
Train a single model using the preprocessed data directory.
python word2vec.py train \
--data-dir output \
--embeddings output/my_embeddings \
--dim 100 \
--lr 0.025 \
--window-size 5 \
--epochs 3add --val if you generated validation offsets during the preprocessing step.
Once training is finished, you can load the generated .npy embedding matrices to query similar words or solve analogies.
Find Similar Words:
python word2vec.py infer \
output/my_embeddings_center.npy \
output/vocab.json \
--similar "war" \
--top-k 10Solve Analogies (A + B - C = D): e.g. King plus woman without a man is a Queen
python word2vec.py infer \
output/my_embeddings_center.npy \
output/vocab.json \
--analogy king woman man