
Measuring Similarity in GPT-2 🕵️
How do real-world transformer models compute similarities between 50,000+ tokens? We dive into GPT-2's embedding space to find the semantic 'neighbors' of common names and objects.
A trained embedding space is not just a collection of vectors; it's a map of relationships. By calculating the similarity between one token and every other token in the vocabulary, we can reveal the model's internal "neighborhoods." In this lesson, we perform this "one-to-all" comparison on GPT-2.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
🚀 The Core Concept
Finding related words in an LLM boils down to a Nearest Neighbors search in latent space:
- Latent Adjacency: Synonyms, related names, and associated concepts cluster together. Their vectors point in virtually the same direction.
- The "Mike" Neighborhood: In GPT-2, the closest neighbors to "Mike" aren't just other names like "Chris" or "Jim," but also semantic variations like "Michael" and the space-prefixed " Mike".
- Vectorization is Key: Comparing one vector against 50,000 others can be slow with loops, but using Matrix Multipliction (Dot Product) on normalized vectors makes it instantaneous.
1. Environment Setup
We'll use numpy and matplotlib for analysis and visualization, and the transformers library to load GPT-2's internal weights.
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.gridspec import GridSpec
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')2. Loading GPT-2 Embedding Weights
We extract the wte (Word Token Embeddings) matrix, which contains the 768-dimensional vectors for all 50,257 tokens in the GPT-2 vocabulary.
from transformers import GPT2Model,GPT2Tokenizer
# pretrained GPT-2 model and tokenizer
gpt2 = GPT2Model.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
# get the Word Token Embeddings matrix
embeddings = gpt2.wte.weight.detach().numpy()/Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm3. Case Study: The Cosmic Banana
Tokens are the building blocks of words. We see that "apple" is a single token, whereas "banana" and "cosmic" are split into two sub-word tokens. We can visualize their vectors to see how the model differentiates between these segments.
# words of interest
word1, word2, word3 = 'banana', 'apple', 'cosmic'
for w in [word1,word2,word3]:
t = tokenizer.encode(w)
print(f'"{w}" comprises {len(t)} tokens: {[tokenizer.decode(i) for i in t]}')"banana" comprises 2 tokens: ['ban', 'ana']
"apple" comprises 1 tokens: ['apple']
"cosmic" comprises 2 tokens: ['cos', 'mic']4. Direct Vector Comparison
By plotting the 768 dimensions of "ban" and "ana" against each other, we can visually inspect their correlation. For segments of the same word, we often find a moderate-to-high cosine similarity.
# setup figure
fig = plt.figure(figsize=(10,7))
gs = GridSpec(2,2)
ax0, ax1 = fig.add_subplot(gs[0,:]), fig.add_subplot(gs[1,0])
# Plot ban/ana
v1 = embeddings[tokenizer.encode('ban')]
v2 = embeddings[tokenizer.encode('ana')]
cossim = np.sum(v1*v2)/(np.linalg.norm(v1)*np.linalg.norm(v2))
ax1.plot(v1,v2,'ko',markerfacecolor=[.7,.7,.7,.6])
ax1.set(xlim=[-.8,.8],ylim=[-.8,.8],xlabel='"ban"',ylabel='"ana"',
title=f'Cosine similarity = {cossim:.3f}')
plt.show()5. Finding Semantic Neighbors (The Loop Way)
To find the top neighbors for "Mike", we can loop through every token in the vocabulary and calculate its cosine similarity to our "seed" vector.
# get seed vector
seed = tokenizer.encode('Mike')
seedvect = embeddings[seed].squeeze()
seedvectNorm = np.linalg.norm(seedvect)
# brute-force loop
cossims = np.zeros(embeddings.shape[0])
for idx,v in enumerate(embeddings):
cossims[idx] = np.sum(seedvect*v) / (seedvectNorm*np.linalg.norm(v))
# top 20
top20 = np.argsort(cossims)[-20:]6. Analyzing the "Mike" Cluster
The result shows that GPT-2's latent space is highly structured. The neighbors of "Mike" are other common male first names, showing that the model has learned a "category" for these tokens based on their usage patterns in the training data.
# tokens closest to "Mike"
for n in top20[::-1]:
print(f'"{tokenizer.decode(n)}" | sim: {cossims[n]:.3f}')"Mike" | sim: 1.000
" Mike" | sim: 0.857
"Michael" | sim: 0.685
"Chris" | sim: 0.647
"Jim" | sim: 0.6407. Efficient Computation with Linear Algebra
Looping 50,000 times in Python is slow. We can perform all 50,000 comparisons at once by normalizing the entire embedding matrix and performing a single matrix-vector multiplication.
# vectorized version
Enorm = embeddings / np.linalg.norm(embeddings,axis=1,keepdims=True)
cossims2 = Enorm[seed] @ Enorm.T
cossims2 = np.squeeze(cossims2)
print(f'Vectorized computation correlation: {np.corrcoef(cossims,cossims2)[0,1]:.3f}')Vectorized computation correlation: 1.0008. Summary: Vectorized Semantic Search
Vectorization isn't just a performance trick; it's how large-scale search engines (like Vector Databases) perform similarity searches across millions of documents. By treating the embedding matrix as a map, we can navigate language with mathematical precision.