
5. Byte Pair Encoding (BPE) Concepts ๐งฌ
Rare words and vast vocabularies are the dual enemies of efficient language modeling. Byte Pair Encoding (BPE) solves both by breaking down language into a set of common subword units.
5. Byte Pair Encoding (BPE) Concepts ๐งฌ
Byte Pair Encoding (BPE) is a cornerstone of modern NLP. It's a subword tokenization method that allows models to represent rare words by breaking them down into common character sequences.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
๐ The Core Concept
To solve the limitations of word-level tokenization, BPE introduces an iterative merging strategy:
- Iterative Merging: Identifying and combining the most frequent adjacent character pairs.
- Vocabulary Stability: Rare or unseen words are handled by breaking them into smaller, known subword pieces.
- Efficiency: The text sequence is compressed into fewer, more meaningful tokens as the vocabulary grows.
1. Why BPE?
Traditional tokenization (splitting by whitespace) has two major flaws:
- Vocabulary Size: Every unique word needs its own ID. This leads to massive vocabularies.
- Out-of-Vocabulary (OOV): If the model sees a word it wasn't trained on (e.g., "liker"), it fails.
BPE solves this by iteratively merging the most frequent character pairs into new tokens.
2. Dataset Initialization
We start by identifying all unique characters in our sample text. These characters form our "Base Vocabulary" (Level 0).
Let's walk through a manual iteration using a toy dataset.
import numpy as np
# A string with many repetitions
text = 'like liker love lovely hug hugs hugging hearts'
# Initial vocabulary: All unique characters
chars = sorted(list(set(text)))
vocab = { char: i for i, char in enumerate(chars) }
# Text must be a list of tokens (initially characters)
tokens = list(text)
print(f"Initial sequence: {tokens[:10]}...")Finding the Most Frequent Pair
We count every pair of adjacent tokens in our sequence.
token_pairs = {}
for i in range(len(tokens)-1):
pair = tokens[i] + tokens[i+1]
token_pairs[pair] = token_pairs.get(pair, 0) + 1
# Find the winner
most_freq_pair = max(token_pairs, key=token_pairs.get)
print(f'Most frequent pair: "{most_freq_pair}" ({token_pairs[most_freq_pair]} times)')Most frequent pair: " h" (4 times)3. Dynamic Vocabulary Growth
Once we find the most frequent pair, we add it as a new, single token to our vocabulary.
Now we add this new "merged" token to our vocabulary.
vocab[most_freq_pair] = max(vocab.values()) + 1
print(f"New Vocabulary size: {len(vocab)}")4. Text Compression
Finally, we replace every instance of the character pair (' ' followed by 'h') with our new single token (' h'). Notice how the total sequence length decreases.
Finally, we replace every instance of the character pair (' ' followed by 'h') with our new single token (' h').
new_text = []
i = 0
while i < (len(tokens) - 1):
if (tokens[i] + tokens[i+1]) == most_freq_pair:
new_text.append(most_freq_pair)
i += 2 # Skip the next character
else:
new_text.append(tokens[i])
i += 1
if i < len(tokens):
new_text.append(tokens[i])
print(f"Original length: {len(tokens)}")
print(f"New length: {len(new_text)}")Original length: 46
New length: 42๐ก What's Next?
In the next lesson, we'll automate this process with a loop to build a vocabulary of any size!