Position Embeddings: Space and Sequence 📍

How does a model know the difference between 'The dog bit the man' and 'The man bit the dog'? We explore Position Embeddings—the geometric signatures that give transformers a sense of order.

Driptanil DattaSoftware Developer

Mar 202515 min read

Unlike Recurrent Neural Networks (RNNs), Transformers process all tokens in a sequence simultaneously. This makes them fast, but it also makes them "order-blind." To fix this, we add Position Embeddings—a set of vectors that tell the model where each token sits in the timeline.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🚀 The Core Concept

Position embeddings provide the "time signature" for our data:

Sequence Awareness: Without these vectors, the sentence "I love coding" and "Coding love I" would look identical to the model.
Additive Geometry: Position vectors are usually added directly to the word embedding vectors. The model learns to disentangle "what the word is" from "where the word is."
Learned vs. Sinusoidal: GPT-2 uses Learned position embeddings (parameters adjusted during training). The original Transformer paper used Sinusoidal (fixed mathematical patterns).
Context Window Limits: The size of the position embedding matrix (e.g., 1024 for GPT-2) defines the model's maximum context length.

1. Environment Setup

We'll use numpy for matrix analysis and the transformers library to inspect GPT-2's learned position weights.

import numpy as np
import matplotlib.pyplot as plt
 
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

2. Extracting GPT-2 Position Weights

In GPT-2, the word embeddings are stored in wte, while position embeddings are in wpe. Notice the shape: 1024 positions by 768 dimensions.

from transformers import GPT2Model
 
# load the model
gpt2 = GPT2Model.from_pretrained('gpt2')
 
# position embeddings matrix
positions = gpt2.wpe.weight.detach().numpy()
print(f'Position matrix shape: {positions.shape}')

Execution Output

Position matrix shape: (1024, 768)

3. Visualizing the Position Matrix

The heatmap below shows the "texture" of order. Each column represents a position in the text (0 to 1023), and each row is one of the 768 dimensions. The patterns here were learned by GPT-2 across billions of words.

# visualize the matrix
plt.figure(figsize=(10,4))
plt.imshow(positions.T,aspect='auto',vmin=-.2,vmax=.2)
plt.gca().set(xlabel='Token position (0-1023)',ylabel='Dimensions',
              title='GPT-2 position embedding matrix')
plt.show()

Output 1

4. Plotting Learned Position Vectors

We can pick random dimensions and see how their weight changes across token positions. Some dimensions might show smooth waves, while others appear noisier. This represents how GPT-2 "decided" to encode the passage of time.

_,axs = plt.subplots(3,4,figsize=(16,6))
 
# pick random vectors
for a in axs.flatten():
 
  # a random position embedding vector
  randidx = np.random.randint(positions.shape[1])
 
  # and plot it
  a.plot(positions[:,randidx],'k',label=f'Position index {randidx}')
  a.axhline(0,linestyle='--',color='gray',zorder=-3)
 
  a.set(xticks=[],yticks=[0],xlim=[0,positions.shape[0]])
  a.legend(fontsize=10)
 
 
# x-axis label on one plot
a.set_xlabel('Start of text <---> end of context window',fontsize=12)
plt.tight_layout()
plt.show()

Output 2

5. Similarity Across Time and Embeddings

How similar is position 5 to position 500? We can calculate cosine similarity for both token positions ("time") and embedding dimensions.

Token index similarity: High diagonal similarity shows that nearby positions are more related.
Embedding index similarity: Shows how various dimensions correlate with each other.

# cosine similarities for "time series" (token index)
Pnorm1 = positions / np.linalg.norm(positions,axis=1,keepdims=True)
cossim_tokens = Pnorm1 @ Pnorm1.T
 
# cosine similarities across embedding dimensions
Pnorm0 = positions / np.linalg.norm(positions,axis=0,keepdims=True)
cossim_embeds = Pnorm0.T @ Pnorm0
 
 
# draw the images
fig,axs = plt.subplots(1,2,figsize=(12,5))
 
h = axs[0].imshow(cossim_tokens,vmin=-1,vmax=1)
axs[0].set(xlabel='Token index ("time")',ylabel='Token index ("time")',title='$S_c$ over "time"')
ch = fig.colorbar(h,ax=axs[0],pad=.02,fraction=.046)
ch.ax.tick_params(labelsize=10)
ch.ax.set_yticks(np.arange(-1,1.1,.5))
 
h = axs[1].imshow(cossim_embeds,vmin=-1,vmax=1)
axs[1].set(xlabel='Embedding index',ylabel='Embedding index',title='$S_c$ across embeddings')
ch = fig.colorbar(h,ax=axs[1],pad=.02,fraction=.046)
ch.ax.tick_params(labelsize=10)
ch.ax.set_yticks(np.arange(-1,1.1,.5))
 
plt.tight_layout()
plt.show()

Output 3

6. Sinusoidal Embeddings (The "Attention" Way)

The landmark Attention Is All You Need paper used fixed sine and cosine functions of different frequencies. This allows the model to theoretically generalize to sequences longer than those seen during training without needing to "learn" new position vectors.

positionsFormula = np.zeros_like(gpt2.wpe.weight.data)
d = positionsFormula.shape[1]
 
# token position ("time")
th = np.arange(positionsFormula.shape[0])
 
# create the vectors
for i in range(0,positionsFormula.shape[1],2):
 
  # denominator scaling factor
  denom = 10000 ** (2*i//2 / d)
 
  # define the embeddings
  positionsFormula[:,i]   = np.sin(th / denom)
  positionsFormula[:,i+1] = np.cos(th / denom)
 
 
 
#### and visualize
_,axs = plt.subplots(1,2,figsize=(12,4))
axs[0].imshow(positionsFormula.T,vmin=-1,vmax=1)
axs[0].set(ylabel='Embedding dimensions',xlabel='Token order ("time")',title='All position embeddings')
 
pos2show = np.linspace(200,600,4,dtype=int)
h = axs[1].plot(positionsFormula[:,pos2show])
axs[1].set(ylabel='Weight value',xlabel='Token order ("time")',xlim=[0,len(th)],title='A few position embeddings')
 
for i,p in enumerate(pos2show):
  axs[0].axhline(p,linestyle='--',color=h[i].get_color(),linewidth=1.8)
 
 
plt.tight_layout()
plt.show()

Output 4

7. Why Multiple Frequencies?

By using a range of frequencies, the model can pinpoint both immediate local context (the word next to me) using high-frequency waves, and long-range global context (the beginning of the chapter) using low-frequency ones.

# Plotting individual dimension waves
plt.figure(figsize=(10,4))
plt.plot(positionsFormula[:, 200:600:100]) # First 4 dimensions
plt.title('Different frequency waves for multi-scale positioning')
plt.show()

Output 5

8. Summary: The Coordinate System of Meaning

Embeddings give us the "what," and position embeddings give us the "where." Together, they form a complete coordinate system that allows the Transformer to build complex, order-dependent representations of language without ever needing a recurrent loop.

The Unembedding Layer 🔓Position Explorations 🔍