The Unembedding Layer: From Latency to Literacy 🔓

How does a vector of 768 numbers turn back into a readable word? We explore the Unembedding Layer—the final 'map' that translates a model's internal thoughts into human-readable tokens.

Driptanil DattaSoftware Developer

Mar 202512 min read

A Large Language Model spends most of its time in a high-dimensional math world. But at the very end of the line, it must convert its internal 768-dimensional "thought" back into a single word. This is the job of the Unembedding Layer.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🚀 The Core Concept

The Unembedding Layer is the "exit gate" of the transformer architecture:

Linear Projection: The model takes its final hidden state (a vector) and projects it onto a massive matrix that has one column for every word in the vocabulary (e.g., 50,257 columns).
Dot Product as Scoring: The "score" for each word is the dot product between the hidden state and that word's unembedding vector. High scores mean high probability.
Softmax and Sampling: These scores (logits) are pushed through a Softmax function to create a probability distribution. The model then picks (samples) the next token from this list.
Weight Tying: In many models, the Unembedding matrix is actually just the transpose of the initial Embedding matrix. The model uses the same "dictionary" to read and write!

1. Environment Setup

We'll use numpy and matplotlib to perform manual projections and verify how the model "sees" its own output.

import numpy as np
import matplotlib.pyplot as plt
 
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

2. Loading GPT-2 Config

Wait, how big is the "thought" versus the "dictionary"? We can find the size parameters in the model config.

from transformers import GPT2Model,GPT2Tokenizer
 
# pretrained GPT-2 model and tokenizer
gpt2 = GPT2Model.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
 
# find the size parameters in .config
gpt2.config
 
# embeddings matrix
embeddings = gpt2.wte.weight.detach().numpy()
 
# the properties we'll use later
print(f'Embedding dimensions: {gpt2.config.n_embd}')
print(f'Vocab size: {gpt2.config.vocab_size}')
print(f'Size of embeddings matrix: {embeddings.shape}')

Execution Output

Embedding dimensions: 768
Vocab size: 50257
Size of embeddings matrix: (50257, 768)

3. Real vs. Random Unembeddings

We create two unembedding matrices: one is the actual learned transpose of the embeddings, and the other is a random matrix of the same shape.

# Exercise 2: Real and random unembeddings
 
# unembeddings matrix as the transpose of the (real) embeddings
unembeddings = embeddings.T
 
# confirm that transposing matrix a copy
print('id of embeddings:  ',id(embeddings))
print('id of unembeddings:',id(unembeddings))
 
# a random unembeddings matrix
unembeddingsRand = np.random.randn(gpt2.config.n_embd,gpt2.config.vocab_size)
 
print(f'         Size of embeddings matrix: {embeddings.shape}')
print(f'Size of random unembeddings matrix: {unembeddingsRand.shape}')
print(f'  Size of real unembeddings matrix: {unembeddings.shape}')

Execution Output

id of embeddings:   4921247472
id of unembeddings: 4921246512
         Size of embeddings matrix: (50257, 768)
Size of random unembeddings matrix: (768, 50257)
  Size of real unembeddings matrix: (768, 50257)

4. Projecting "California"

Let's take the embedding vector for the word "California" and project it onto both matrices. When using the real matrix, the dot product should be highest for the word itself.

# Exercise 3: California embedding
 
# pick a word
seedword = ' California'
 
# its token index
seed_idx = tokenizer.encode(seedword)
 
# make sure it's one token (3442)
seed_idx
 
# get its embedding vector
embed_vector = embeddings[seed_idx,:]
 
# project onto random and real matrices (dot products)
dpRand = embed_vector @ unembeddingsRand
dpReal = embed_vector @ unembeddings
 
# find the tokens with largest dot products
nextTokenRand_idx = np.argmax(dpRand)
nextTokenReal_idx = np.argmax(dpReal)
 
print(f'** Random matrix: "{tokenizer.decode(nextTokenRand_idx)}" (index {nextTokenRand_idx})')
print(f'** Real matrix:   "{tokenizer.decode(nextTokenReal_idx)}" (index {nextTokenReal_idx})')

Execution Output

** Random matrix: "ifling" (index 12431)
** Real matrix:   " California" (index 3442)

5. Visualizing the Dot Product "Skyline"

When we plot the dot products across the entire vocabulary, the real unembedding matrix shows a clear "peak" at the correct token. The random matrix is just noise.

# plot it!
_,axs = plt.subplots(1,2,figsize=(12,3))
axs[0].scatter(range(tokenizer.vocab_size),dpRand,s=30,c=abs(dpRand),cmap='RdPu',alpha=.4)
axs[0].axvline(nextTokenRand_idx,linestyle='--',color='k',alpha=1/3)
axs[0].plot(nextTokenRand_idx,dpRand[0,nextTokenRand_idx],'gv')
axs[0].set(xlabel='Unembedding dimension',ylabel='Dot product',xlim=[-11,tokenizer.vocab_size+10],
              title=f'(Random) dot products with "{tokenizer.decode(seed_idx)}"')
 
axs[1].scatter(range(tokenizer.vocab_size),dpReal,s=30,c=abs(dpReal),cmap='RdPu',alpha=.4)
axs[1].axvline(nextTokenReal_idx,linestyle='--',color='k',alpha=1/3)
axs[1].plot(nextTokenReal_idx,dpReal[0,nextTokenReal_idx],'gv')
axs[1].set(xlabel='Unembedding dimension',ylabel='Dot product',xlim=[-11,tokenizer.vocab_size+10],
              title=f'(Real) dot products with "{tokenizer.decode(seed_idx)}"')
 
 
plt.tight_layout()
plt.show()

Output 2

6. Finding Top-10 Semantic Neighbors

The unembedding layer doesn't just find exact matches; it also scores related words highly. For "California", we see other variations of the word and neighboring states.

# Exercise 4: Find top-10 unembeddings
top10 = np.argsort(dpReal[0])[::-1][:10]
 
for i in top10:
  print(f'Dot product {dpReal[0,i]:6.3f} for token "{tokenizer.decode(i)}"')

Execution Output

Dot product 10.136 for token "California"
Dot product  9.617 for token " California"
Dot product  8.816 for token " Californ"
Dot product  7.359 for token " Nevada"
Dot product  7.158 for token "Arizona"

7. Simulating Sentence Generation

We can simulate a minimal language model by taking a seed word, unembedding it, picking one of the top-10 likely "neighbors," and repeating. This is the seed of how LLMs generate text.

# Exercise 5: Generate a token sequence
# sequence length
seq_len = 10
 
# initial seed
nextword = 'budget'
 
# initializing a list that will contain the text
text = nextword
 
 
# loop to create the sequence
for i in range(seq_len-1):
 
  # step 1: tokenize
  token = tokenizer.encode(nextword)
 
  # step 2: get embedding vector
  embed_vector = embeddings[token,:]
 
  # step 3: project onto unembedding matrix (dot products)
  dp = embed_vector @ unembeddings
 
  # step 4: find top10 projections
  top10 = np.argsort(dp[0])[::-1][:10]
 
  # step 5: randomly pick one for next token
  aRandomToken = np.random.choice(top10)
  nextword = tokenizer.decode(aRandomToken)
 
  # step 6: append the text
  text += nextword
 
# print the final result!
print('Our very philosophically meaningful text:\n',text)

Execution Output

Our very philosophically meaningful text:
 budget budgets coffers wallets walletsertoddyipertoddconservancy sacrific

8. Compare with Random Generation

Repeating the same generative loop with the unembeddingsRand matrix yields literal noise, confirming that the structure of the unembedding layer is what allows the latent vectors to "speak" human language.

# Result with random matrix (loop omitted for brevity)
# budget Winnipeg mech decap Saur POLITICOtaatoninks Grimoire

Summary: The Unembedding layer is the final projection that maps the model's abstract latent space back onto human language. It turns high-dimensional math back into the words we read.