
Coding Challenge: Byte Pair Encoding (BPE) 🧬
Byte Pair Encoding (BPE) is the backbone of most modern LLM tokenizers. In this lesson, we'll build a BPE loop from scratch to understand how it iteratively compresses sequences.
Coding Challenge: Byte Pair Encoding (BPE) 🧬
Byte Pair Encoding (BPE) is a subword tokenization method that iteratively merges the most frequent pairs of characters (or tokens) into new, larger tokens. This allows models to handle a vast vocabulary using a small set of base units.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
🚀 The Core Concept
To scale beyond simple character splitting, BPE uses an iterative process to build a vocabulary of common subwords:
- Bit-level Setup: Starting with simple characters as our initial vocabulary.
- Pair Counting: Identifying which adjacent tokens appear together most frequently.
- Merging: Creating a new token for the top pair and adding it to the vocabulary.
- Looping: Repeating the process until we reach our desired vocabulary size.
1. Initialize the Vocabulary
We start by identifying all unique characters in our sample text. These characters form our "Base Vocabulary" (Level 0).
We start with a simple string with many repetitions to see how BPE discovers patterns like "like" and "love".
import numpy as np
# some text with lots of repetitions
text = 'like liker love lovely hug hugs hugging hearts'
chars = list(set(text))
chars.sort() # initial vocab is sorted
for l in chars:
print(f'"{l}" appears {text.count(l)} times.')
# make a vocabulary
vocab = { word:i for i,word in enumerate(chars) }
print(f"\nInitial Vocabulary: {vocab}")
# the text needs to be a list, not a string
# each element in the list is a token
origtext = list(text)
print(f"\nOriginal tokens: {origtext}")" " appears 7 times.
"a" appears 1 times.
"e" appears 5 times.
"g" appears 5 times.
"h" appears 4 times.
"i" appears 3 times.
"k" appears 2 times.
"l" appears 5 times.
"n" appears 1 times.
"o" appears 2 times.
"r" appears 2 times.
"s" appears 2 times.
"t" appears 1 times.
"u" appears 3 times.
"v" appears 2 times.
"y" appears 1 times.
Initial Vocabulary: {' ': 0, 'a': 1, 'e': 2, 'g': 3, 'h': 4, 'i': 5, 'k': 6, 'l': 7, 'n': 8, 'o': 9, 'r': 10, 's': 11, 't': 12, 'u': 13, 'v': 14, 'y': 15}2. Core Functions
To automate the BPE process, we need three primary functions: one to count pairs, one to pick the winner and update the vocabulary, and one to actually perform the merge in our text.
We define three functions to handle the BPE process: , and merging tokens in the sequence.
def get_pair_stats(text2pair):
token_pairs = dict()
# loop over tokens
for i in range(len(text2pair)-1):
# create a pair
pair = text2pair[i] + text2pair[i+1]
# increase pair frequencies
if pair in token_pairs:
token_pairs[pair] += 1
else:
token_pairs[pair] = 1
return token_pairs
def update_vocab(token_pairs,vocab):
# find the most frequent pair
idx = np.argmax(list(token_pairs.values()))
newtok = list(token_pairs.keys())[idx]
# update the vocab
vocab[newtok] = max(vocab.values())+1
return vocab,newtok
def generate_new_token_seq(prevtext,newtoken):
# initialize a new text list
newtext = []
# loop through the list
i = 0
while i<(len(prevtext)-1):
# test whether the pair of this and the following elements match the newly-created pair
if (prevtext[i]+prevtext[i+1]) == newtoken:
newtext.append(newtoken)
i += 2 # skip the next character
# not a pair
else:
newtext.append(prevtext[i])
i += 1 # move to the next character
if i < len(prevtext):
newtext.append(prevtext[i])
return newtext3. The BPE Loop
With our helper functions ready, we can run the main BPE loop. We'll iteratively find the most common pair, merge it, and grow our vocabulary from 16 to 25 items.
Now we run the main loop to expand the vocabulary from 16 initial characters to 25 tokens.
# re-initialize the vocab
vocab = { word:i for i,word in enumerate(chars) }
# how many tokens in the vocab?
vocab_size = 25
# make a copy of the original text
updated_text = origtext.copy()
while len(vocab)<vocab_size:
# get pair statistics
pairs = get_pair_stats(updated_text)
# update the dictionary
vocab,newtoken = update_vocab(pairs,vocab)
# get a new list of tokens
updated_text = generate_new_token_seq(updated_text,newtoken)
print(f'Vocab has {len(vocab)} tokens after adding "{newtoken}"')Vocab has 17 tokens after adding " h"
Vocab has 18 tokens after adding " l"
Vocab has 19 tokens after adding " hu"
Vocab has 20 tokens after adding " hug"
Vocab has 21 tokens after adding "ik"
Vocab has 22 tokens after adding "ike"
Vocab has 23 tokens after adding " lo"
Vocab has 24 tokens after adding " lov"
Vocab has 25 tokens after adding " love"4. Final Results
After several iterations, we can compare our initial character-level text with the final compressed sequence. Notice how common words like "love" and "hug" have become single tokens.
Finally, let's look at the compressed token sequence and the final vocabulary.
# final tokenized text
print(f"Final sequence: {updated_text}")
# final vocabulary size and sample
print(f"Final vocab size: {len(vocab)}")
print(f"Final vocab: {vocab}")Final sequence: ['l', 'ike', ' l', 'ike', 'r', ' love', ' love', 'l', 'y', ' hug', ' hug', 's', ' hug', 'g', 'i', 'n', 'g', ' h', 'e', 'a', 'r', 't', 's']
Final vocab size: 25💡 Summary
By the end of this challenge, we've seen how BPE successfully merged characters into meaningful subwords like love, hug, and ike. This is the exact mechanism that enables Large Language Models to handle a vast range of languages and technical jargon with a relatively small, fixed-size vocabulary.