Natural Language Processing

Mikolov et al. · 2013 · arXiv

Word2Vec

Teaching machines the meaning of words

Your computer sees the word "cat" as the number 4,892 and "dog" as 7,231. To it, they're just as different as "cat" and "democracy."

How do you teach a machine that cats and dogs are related — that they're both animals, both pets, both furry?

This is the word representation problem — the central challenge of getting computers to understand language. Computers only understand numbers. How do you turn words into numbers in a way that preserves meaning?

Word2Vec is the breakthrough that solved this. It showed that meaning could be learned from raw text alone — no dictionaries, no human labels, no knowledge bases. Just patterns.

The obvious approach (and why it fails)

Let's say you have a vocabulary of 10,000 words. The simplest way to represent each word as a number is called one-hot encoding:

Each word gets a list of 10,000 numbers — all zeros except for a single 1:

"cat" = [0, 0, 0, ..., 1, ..., 0, 0] (1 in position 4,892)

"dog" = [0, 0, 0, ..., 1, ..., 0, 0] (1 in position 7,231)

"democracy" = [0, 0, 0, ..., 1, ..., 0, 0] (1 in position 12,847)

Seems reasonable. Each word has a unique identifier. But here's the fatal flaw.

These lists of numbers are called vectors. You can think of a vector as a point in space — if your vector has 2 numbers, it's a point in 2D space; with 3 numbers, a point in 3D space. One-hot encoding gives us vectors with 10,000 numbers, so each word is a point in 10,000-dimensional space.

And just like points in regular space, we can measure the distance between them. If two words are similar, we'd want them to be close together. If they're different, far apart.

So what's the distance between "cat" and "dog" in one-hot space?

Every word is equally distant from every other

Let's work it out. "Cat" has a 1 in position 4,892, "dog" has a 1 in position 7,231. The distance formula (Euclidean distance) squares the difference at each position:

Position 4,892: (1 - 0)² = 1

Position 7,231: (0 - 1)² = 1

All other positions: (0 - 0)² = 0

Total: √(1 + 1) = √2

Try any pair of different words — you'll always get √2. Cat to dog? √2. Cat to democracy? √2. Cat to "the"? √2.

In one-hot space, all words are equidistant. There's no structure, no meaning encoded.

In one-hot space, all words are isolated islands — equally far from everything else.

The key insight

One-hot encoding treats words as arbitrary IDs with no relationships. We need a representation where similar words are close together.

The insight that changed everything

"You shall know a word by the company it keeps."

— J.R. Firth, 1957

This is the distributional hypothesis — the idea that words appearing in similar contexts have similar meanings. It sounds simple, but it's profound.

Consider these sentences with a blank:

"The ___ sat on the mat"

Likely words: cat, dog, child, baby

"I adopted a ___ from the shelter"

Likely words: cat, dog, rabbit, bird

"The ___ barked loudly at the mailman"

Likely words: dog (very specific!)

Words that can fill the same blanks are semantically related. "Cat" and "dog" appear near words like "pet," "fur," "veterinarian." "Democracy" appears near "government," "voting," "citizens." Different contexts reveal different meanings.

Why this is revolutionary

We don't need a dictionary, a knowledge base, or human labels. Meaning emerges from patterns of usage in raw text. Give a computer enough text, and it can learn what words mean — without ever being told.

From insight to mechanism: Skip-gram

The distributional hypothesis tells us what to look for. But how do we turn "context patterns" into actual numbers? The answer: train a neural network on a clever task.

The Skip-gram trick

Word2Vec's Skip-gram model learns word vectors by playing a prediction game:

Take a sentence: "The quick brown fox jumps over the lazy dog"

1. Pick a word as the "center" — let's say "fox"
2. Define a "context window" around it — say, 2 words on each side
3. Train the network to predict the context words given the center word

Context window for "fox": [quick, brown, jumps, over]

The network's job: given "fox" as input, predict that "quick," "brown," "jumps," and "over" are likely nearby. Do this for every word in every sentence in a massive corpus.

Context Window Explorer

Window: ±2

Click any word to select it as the center word, then see which context words fall within the window.

Click on any word to see its context window and training pairs

What to notice

• Each word generates multiple training pairs — one for every context word in the window
• Words at the edges have fewer pairs (nothing to the left of the first word)
• A larger window means more pairs per word — and broader context captured
• The same context word at different positions creates different pairs

The key insight: By sliding through billions of sentences, the model sees which words tend to appear together — and words that share context become similar.

Where does the vector come from?

Here's the clever part. Remember, our input is a one-hot vector — 10,000 numbers, mostly zeros. That's a terrible representation (we established that all words are equally distant).

What if we forced that 10,000-number input through a much smaller bottleneck — say, just 300 numbers — before trying to predict the context words?

Information must squeeze through 300 numbers to predict context. Those 300 numbers are the word vector.

That middle layer — the bottleneck — is where the magic happens. The network learns to compress each word into just 300 numbers in a way that preserves what's needed to predict context.

Why compression forces meaning

With 10,000 words squeezed into 300 dimensions, the network can't memorize each word separately — there's not enough room. It must find patterns: words that appear in similar contexts get similar vectors.

Think of it like this: if "cat" and "dog" both predict the same context words (pet, furry, vet), why waste precious dimensions encoding them differently? Give them similar vectors.

The beautiful circularity

We never tell the network what words mean. We just ask it to predict context. But to do that well, it has to learn meaning. Semantics emerge as a side effect of prediction.

The embedding space

Once trained, we have a vector for every word. Let's explore what this space looks like.

Below is a 2D projection of real word embeddings. Words are positioned based on their meaning — similar words cluster together. You can click on any word to see its nearest neighbors.

Embedding Space Explorer

Show labels

Each dot is a word. Similar words cluster together. Click a word to see its nearest neighbors, or filter by category below.

Loading embeddings...

What to notice

• Animals cluster together
• Countries cluster together
• Verbs form their own region
• Family terms are grouped

The geometry of meaning

This isn't just random clustering. The space has structure that mirrors human intuition about word relationships.

The magic: vector arithmetic on words

Here's the most famous result from Word2Vec — the one that made headlines and blew minds:

king - man + woman ≈ queen

You can do math on words, and it makes semantic sense. This isn't a trick — it emerges naturally from how the vectors are learned.

Why does this work?

Think about what these vectors encode. The word "king" has certain properties: royalty, power, leadership, and also maleness. "Queen" has most of the same properties, except femaleness instead of maleness.

The vector arithmetic:

vector("king") - vector("man") → removes "maleness," keeps royalty
... + vector("woman") → adds "femaleness"
Result ≈ vector("queen") → royalty + femaleness = queen

The direction from "man" to "woman" in this space is the concept of gender. And that same direction appears between king/queen, uncle/aunt, brother/sister. The vectors learned abstract concepts as geometric relationships.

Analogy Calculator

king − man + woman = ?

"king" is to "man" as "woman" is to ___

Word A

Word B

Word C

Try these:

Why It Works

Enter words to see how
vector arithmetic works

Available words (0 total)

This works for many relationships

Gender

king - man + woman = queen

Capitals

Paris - France + Japan = Tokyo

Comparative

bigger - big + small = smaller

Family

father - man + woman = mother

Why this matters

Before Word2Vec (2013), working with language computationally was hard:

Before Word2Vec

• Words were arbitrary symbols
• Similarity required hand-crafted features
• WordNet and expert linguists needed
• No way to generalize to new words

After Word2Vec

• Words have geometry
• Meaning is learned from raw text
• Similarity is just distance
• Scales to any vocabulary

Word2Vec democratized semantic understanding. You don't need expensive knowledge bases or teams of linguists. You need text and compute. This opened the floodgates.

The lineage to modern AI

Word2Vec (2013) ~1 yr→ GloVe (2014) ~4 yrs→ ELMo (2018) ~6 mo→ BERT (2018) → GPT (2018+)

Every modern language model builds on the idea that words (and later, sentences) can be vectors. Word2Vec was the proof of concept that launched a revolution. In just 5 years, we went from static word vectors to GPT.

The math (for the curious)

Click to expand the formal equations

Skip-gram Objective

Maximize: Σ log P(context | center)

For each center word, maximize the probability of observing its actual context words.

Softmax Probability

P(w_o | w_i) = exp(v'_o · v_i) / Σ exp(v'_w · v_i)

The probability of a context word given a center word, based on dot product similarity.

Negative Sampling (Speed Trick)

Computing the full softmax over the entire vocabulary is expensive. Negative sampling approximates it by contrasting positive examples with random negatives.

log σ(v'_o · v_i) + Σ_k log σ(-v'_neg · v_i)

Common Hyperparameters

• Embedding dimension: 100-300 (how many numbers per word)
• Context window: 5-10 (how far to look for context)
• Negative samples: 5-20 (how many negatives per positive)
• Min word frequency: 5 (ignore rare words)

Limitations and what came next

Word2Vec was a breakthrough, but it has limitations that later models addressed:

•
One vector per word
"Bank" (river) and "bank" (financial) share the same vector. The model can't distinguish word senses. → Fixed by ELMo, BERT (contextual embeddings)
•
No word order
"Dog bites man" and "man bites dog" would have similar representations. → Fixed by RNNs, Transformers
•
Static embeddings
The vector for a word never changes, regardless of how it's used. → Modern LLMs compute dynamic representations
•
Training data biases
Embeddings reflect biases in the training text (e.g., gender stereotypes in professions). → Active area of research in fairness

Despite these limitations, Word2Vec's core insight — that meaning can be learned as geometry — remains foundational. Every language model since has built on this principle.

Original Paper

Efficient Estimation of Word Representations in Vector Space

Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean

arXiv:1301.3781, 2013

Read the original paper →