5. Byte Pair Encoding (BPE) Concepts 🧬

Rare words and vast vocabularies are the dual enemies of efficient language modeling. Byte Pair Encoding (BPE) solves both by breaking down language into a set of common subword units.

Driptanil DattaSoftware Developer

Mar 202510 min read

5. Byte Pair Encoding (BPE) Concepts 🧬

Byte Pair Encoding (BPE) is a cornerstone of modern NLP. It's a subword tokenization method that allows models to represent rare words by breaking them down into common character sequences.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🚀 The Core Concept

To solve the limitations of word-level tokenization, BPE introduces an iterative merging strategy:

Iterative Merging: Identifying and combining the most frequent adjacent character pairs.
Vocabulary Stability: Rare or unseen words are handled by breaking them into smaller, known subword pieces.
Efficiency: The text sequence is compressed into fewer, more meaningful tokens as the vocabulary grows.

1. Why BPE?

Traditional tokenization (splitting by whitespace) has two major flaws:

Vocabulary Size: Every unique word needs its own ID. This leads to massive vocabularies.
Out-of-Vocabulary (OOV): If the model sees a word it wasn't trained on (e.g., "liker"), it fails.

BPE solves this by iteratively merging the most frequent character pairs into new tokens.

2. Dataset Initialization

We start by identifying all unique characters in our sample text. These characters form our "Base Vocabulary" (Level 0).

Let's walk through a manual iteration using a toy dataset.

import numpy as np
 
# A string with many repetitions
text = 'like liker love lovely hug hugs hugging hearts'
 
# Initial vocabulary: All unique characters
chars = sorted(list(set(text)))
vocab = { char: i for i, char in enumerate(chars) }
 
# Text must be a list of tokens (initially characters)
tokens = list(text)
 
print(f"Initial sequence: {tokens[:10]}...")

Finding the Most Frequent Pair

We count every pair of adjacent tokens in our sequence.

token_pairs = {}
for i in range(len(tokens)-1):
    pair = tokens[i] + tokens[i+1]
    token_pairs[pair] = token_pairs.get(pair, 0) + 1
 
# Find the winner
most_freq_pair = max(token_pairs, key=token_pairs.get)
print(f'Most frequent pair: "{most_freq_pair}" ({token_pairs[most_freq_pair]} times)')

Execution Output

Most frequent pair: " h" (4 times)

3. Dynamic Vocabulary Growth

Once we find the most frequent pair, we add it as a new, single token to our vocabulary.

Now we add this new "merged" token to our vocabulary.

vocab[most_freq_pair] = max(vocab.values()) + 1
print(f"New Vocabulary size: {len(vocab)}")

4. Text Compression

Finally, we replace every instance of the character pair (' ' followed by 'h') with our new single token (' h'). Notice how the total sequence length decreases.

Finally, we replace every instance of the character pair (' ' followed by 'h') with our new single token (' h').

new_text = []
i = 0
while i < (len(tokens) - 1):
    if (tokens[i] + tokens[i+1]) == most_freq_pair:
        new_text.append(most_freq_pair)
        i += 2 # Skip the next character
    else:
        new_text.append(tokens[i])
        i += 1
if i < len(tokens):
    new_text.append(tokens[i])
 
print(f"Original length: {len(tokens)}")
print(f"New length:      {len(new_text)}")

Execution Output

Original length: 46
New length:      42

💡 What's Next?

In the next lesson, we'll automate this process with a loop to build a vocabulary of any size!

4. Tokenizing "The Time Machine"6. Coding Challenge: BPE Loop