Skip to content

Vyien/word2vec

Repository files navigation

Task:

Implement the core training loop of word2vec in pure NumPy (no PyTorch / TensorFlow or other ML frameworks). The applicant is free to choose any suitable text dataset. The task is to implement the optimization procedure (forward pass, loss, gradients, and parameter updates) for a standard word2vec variant (e.g. skip-gram with negative sampling or CBOW).

Project

The project implements the Skip-gram model with Negative Sampling, following the approach described in the original Word2Vec paper.

Dev logs

Well it all started with some notebook prototyping and after formulating the plan I started creating the target project. Along the way there was obviously a lot of refactoring and changes, which made me start over or rewrite a lot. A vast majority of these changes were driven by the fact that I was learning most of the concepts and on the fly, which naturally led to frequent redesigns. Unsurprisingly (though still not exactly welcome) it caused a few nights with less sleep than I would like.

I wanted the code to be quite modular allowing future extension like adding CBOW at a later date. While that is not entirely done there is definitely foundation for that using which I plan to refactor the code in the future. What's most paramount (and also quite obvious) the whole process provided me with a significant amount of new knowledge.

My key personal goal was to make king + woman - man = queen work from the original paper and it does, so achieving this I do count the project as a success.

Well to sum up there is certainly room for improvement, but deadline is impending and I have already fallen behind on my studies, so they need some of my love too. Nevertheless, I do plan to come back to this project at a later date.

A few examples of my model:

Let's find words similar to war

python word2vec.py infer model/embeddings_combined.npy model/vocab.json --similar war
  outbreak             0.6947
  civil                0.6346
  hostilities          0.5884
  world                0.5811
  mithridatic          0.5791
  ii                   0.5743
  boer                 0.5714
  crimean              0.5574
  wars                 0.5555
  bregalnica           0.5513

And famous equation king + woman - man

python word2vec.py infer model/embeddings_combined.npy model/vocab.json --analogy king woman man
  queen                0.6737
  regnant              0.6505
  consort              0.6236
  eurypontid           0.5918
  princess             0.5857
  suddhodana           0.5820
  hedvig               0.5773
  athelstan            0.5737
  rafohy               0.5712
  peuta                0.5707

CLI Usage

Python 3.13

Of course start with creating a vritual environment for instance using uv or python -m venv

The model files are huge, so git lfs is necessary :)

Downloading a dataset

You can easily download the dataset used using the provided downloader script: dataset_downloader.py

The core program is divided into three main functionalities:

1. Preprocessing

Before training, the raw text format must be converted into a binary numerical format of vocabulary indexes.

python word2vec.py preprocess data/wiki.train.tokens \
    --data-dir output \
    --val-input data/wiki.valid.tokens \
    --min-count 5

Outputs: corpus.bin, offsets.npy, and vocab.json inside the specified directory.

2. Training

Train a single model using the preprocessed data directory.

python word2vec.py train \
    --data-dir output \
    --embeddings output/my_embeddings \
    --dim 100 \
    --lr 0.025 \
    --window-size 5 \
    --epochs 3

add --val if you generated validation offsets during the preprocessing step.

3. Inference

Once training is finished, you can load the generated .npy embedding matrices to query similar words or solve analogies.

Find Similar Words:

python word2vec.py infer \
    output/my_embeddings_center.npy \
    output/vocab.json \
    --similar "war" \
    --top-k 10

Solve Analogies (A + B - C = D): e.g. King plus woman without a man is a Queen

python word2vec.py infer \
    output/my_embeddings_center.npy \
    output/vocab.json \
    --analogy king woman man

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages