Coding Challenge: Byte Pair Encoding (BPE) 🧬

Byte Pair Encoding (BPE) is the backbone of most modern LLM tokenizers. In this lesson, we'll build a BPE loop from scratch to understand how it iteratively compresses sequences.

Driptanil DattaSoftware Developer

Mar 202510 min read

Coding Challenge: Byte Pair Encoding (BPE) 🧬

Byte Pair Encoding (BPE) is a subword tokenization method that iteratively merges the most frequent pairs of characters (or tokens) into new, larger tokens. This allows models to handle a vast vocabulary using a small set of base units.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🚀 The Core Concept

To scale beyond simple character splitting, BPE uses an iterative process to build a vocabulary of common subwords:

Bit-level Setup: Starting with simple characters as our initial vocabulary.
Pair Counting: Identifying which adjacent tokens appear together most frequently.
Merging: Creating a new token for the top pair and adding it to the vocabulary.
Looping: Repeating the process until we reach our desired vocabulary size.

1. Initialize the Vocabulary

We start by identifying all unique characters in our sample text. These characters form our "Base Vocabulary" (Level 0).

We start with a simple string with many repetitions to see how BPE discovers patterns like "like" and "love".

import numpy as np
 
# some text with lots of repetitions
text = 'like liker love lovely hug hugs hugging hearts'
 
chars = list(set(text))
chars.sort() # initial vocab is sorted
 
for l in chars:
  print(f'"{l}" appears {text.count(l)} times.')
 
# make a vocabulary
vocab = { word:i for i,word in enumerate(chars) }
print(f"\nInitial Vocabulary: {vocab}")
 
# the text needs to be a list, not a string
# each element in the list is a token
origtext = list(text)
print(f"\nOriginal tokens: {origtext}")

Execution Output

" " appears 7 times.
"a" appears 1 times.
"e" appears 5 times.
"g" appears 5 times.
"h" appears 4 times.
"i" appears 3 times.
"k" appears 2 times.
"l" appears 5 times.
"n" appears 1 times.
"o" appears 2 times.
"r" appears 2 times.
"s" appears 2 times.
"t" appears 1 times.
"u" appears 3 times.
"v" appears 2 times.
"y" appears 1 times.

Initial Vocabulary: {' ': 0, 'a': 1, 'e': 2, 'g': 3, 'h': 4, 'i': 5, 'k': 6, 'l': 7, 'n': 8, 'o': 9, 'r': 10, 's': 11, 't': 12, 'u': 13, 'v': 14, 'y': 15}

2. Core Functions

To automate the BPE process, we need three primary functions: one to count pairs, one to pick the winner and update the vocabulary, and one to actually perform the merge in our text.

We define three functions to handle the BPE process: counting pairs, updating the vocabulary, and merging tokens in the sequence.

def get_pair_stats(text2pair):
  token_pairs = dict()
 
  # loop over tokens
  for i in range(len(text2pair)-1):
 
    # create a pair
    pair = text2pair[i] + text2pair[i+1]
 
    # increase pair frequencies
    if pair in token_pairs:
      token_pairs[pair] += 1
    else:
      token_pairs[pair] = 1
 
  return token_pairs
 
def update_vocab(token_pairs,vocab):
 
  # find the most frequent pair
  idx = np.argmax(list(token_pairs.values()))
  newtok = list(token_pairs.keys())[idx]
 
  # update the vocab
  vocab[newtok] = max(vocab.values())+1
  return vocab,newtok
 
def generate_new_token_seq(prevtext,newtoken):
 
  # initialize a new text list
  newtext = []
 
  # loop through the list
  i = 0
  while i<(len(prevtext)-1):
 
    # test whether the pair of this and the following elements match the newly-created pair
    if (prevtext[i]+prevtext[i+1]) == newtoken:
      newtext.append(newtoken)
      i += 2 # skip the next character
 
    # not a pair
    else:
      newtext.append(prevtext[i])
      i += 1 # move to the next character
 
  if i < len(prevtext):
    newtext.append(prevtext[i])
 
  return newtext

3. The BPE Loop

With our helper functions ready, we can run the main BPE loop. We'll iteratively find the most common pair, merge it, and grow our vocabulary from 16 to 25 items.

Now we run the main loop to expand the vocabulary from 16 initial characters to 25 tokens.

# re-initialize the vocab
vocab = { word:i for i,word in enumerate(chars) }
 
# how many tokens in the vocab?
vocab_size = 25
 
# make a copy of the original text
updated_text = origtext.copy()
 
while len(vocab)<vocab_size:
 
  # get pair statistics
  pairs = get_pair_stats(updated_text)
 
  # update the dictionary
  vocab,newtoken = update_vocab(pairs,vocab)
 
  # get a new list of tokens
  updated_text = generate_new_token_seq(updated_text,newtoken)
  print(f'Vocab has {len(vocab)} tokens after adding "{newtoken}"')

Execution Output

Vocab has 17 tokens after adding " h"
Vocab has 18 tokens after adding " l"
Vocab has 19 tokens after adding " hu"
Vocab has 20 tokens after adding " hug"
Vocab has 21 tokens after adding "ik"
Vocab has 22 tokens after adding "ike"
Vocab has 23 tokens after adding " lo"
Vocab has 24 tokens after adding " lov"
Vocab has 25 tokens after adding " love"

4. Final Results

After several iterations, we can compare our initial character-level text with the final compressed sequence. Notice how common words like "love" and "hug" have become single tokens.

Finally, let's look at the compressed token sequence and the final vocabulary.

# final tokenized text
print(f"Final sequence: {updated_text}")
 
# final vocabulary size and sample
print(f"Final vocab size: {len(vocab)}")
print(f"Final vocab: {vocab}")

Execution Output

Final sequence: ['l', 'ike', ' l', 'ike', 'r', ' love', ' love', 'l', 'y', ' hug', ' hug', 's', ' hug', 'g', 'i', 'n', 'g', ' h', 'e', 'a', 'r', 't', 's']

Final vocab size: 25

💡 Summary

By the end of this challenge, we've seen how BPE successfully merged characters into meaningful subwords like love, hug, and ike. This is the exact mechanism that enables Large Language Models to handle a vast range of languages and technical jargon with a relatively small, fixed-size vocabulary.

5. Byte Pair Encoding (BPE) Concepts 7. Exploring GPT-4's Tokenizer