๐Ÿš€
5. Byte Pair Encoding (BPE) Concepts
AI

5. Byte Pair Encoding (BPE) Concepts ๐Ÿงฌ

Rare words and vast vocabularies are the dual enemies of efficient language modeling. Byte Pair Encoding (BPE) solves both by breaking down language into a set of common subword units.

Mar 202510 min read

5. Byte Pair Encoding (BPE) Concepts ๐Ÿงฌ

Byte Pair Encoding (BPE) is a cornerstone of modern NLP. It's a subword tokenization method that allows models to represent rare words by breaking them down into common character sequences.

๐ŸŒ
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


๐Ÿš€ The Core Concept

To solve the limitations of word-level tokenization, BPE introduces an iterative merging strategy:

  1. Iterative Merging: Identifying and combining the most frequent adjacent character pairs.
  2. Vocabulary Stability: Rare or unseen words are handled by breaking them into smaller, known subword pieces.
  3. Efficiency: The text sequence is compressed into fewer, more meaningful tokens as the vocabulary grows.

1. Why BPE?

Traditional tokenization (splitting by whitespace) has two major flaws:

  1. Vocabulary Size: Every unique word needs its own ID. This leads to massive vocabularies.
  2. Out-of-Vocabulary (OOV): If the model sees a word it wasn't trained on (e.g., "liker"), it fails.

BPE solves this by iteratively merging the most frequent character pairs into new tokens.


2. Dataset Initialization

We start by identifying all unique characters in our sample text. These characters form our "Base Vocabulary" (Level 0).

Let's walk through a manual iteration using a toy dataset.

import numpy as np
 
# A string with many repetitions
text = 'like liker love lovely hug hugs hugging hearts'
 
# Initial vocabulary: All unique characters
chars = sorted(list(set(text)))
vocab = { char: i for i, char in enumerate(chars) }
 
# Text must be a list of tokens (initially characters)
tokens = list(text)
 
print(f"Initial sequence: {tokens[:10]}...")

Finding the Most Frequent Pair

We count every pair of adjacent tokens in our sequence.

token_pairs = {}
for i in range(len(tokens)-1):
    pair = tokens[i] + tokens[i+1]
    token_pairs[pair] = token_pairs.get(pair, 0) + 1
 
# Find the winner
most_freq_pair = max(token_pairs, key=token_pairs.get)
print(f'Most frequent pair: "{most_freq_pair}" ({token_pairs[most_freq_pair]} times)')
Execution Output
Most frequent pair: " h" (4 times)

3. Dynamic Vocabulary Growth

Once we find the most frequent pair, we add it as a new, single token to our vocabulary.

Now we add this new "merged" token to our vocabulary.

vocab[most_freq_pair] = max(vocab.values()) + 1
print(f"New Vocabulary size: {len(vocab)}")

4. Text Compression

Finally, we replace every instance of the character pair (' ' followed by 'h') with our new single token (' h'). Notice how the total sequence length decreases.

Finally, we replace every instance of the character pair (' ' followed by 'h') with our new single token (' h').

new_text = []
i = 0
while i < (len(tokens) - 1):
    if (tokens[i] + tokens[i+1]) == most_freq_pair:
        new_text.append(most_freq_pair)
        i += 2 # Skip the next character
    else:
        new_text.append(tokens[i])
        i += 1
if i < len(tokens):
    new_text.append(tokens[i])
 
print(f"Original length: {len(tokens)}")
print(f"New length:      {len(new_text)}")
Execution Output
Original length: 46
New length:      42

๐Ÿ’ก What's Next?

In the next lesson, we'll automate this process with a loop to build a vocabulary of any size!

ยฉ 2026 Driptanil Datta. All rights reserved.

Software Developer & Engineer

Disclaimer:The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP:Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

Built with Love โค๏ธ | Last updated: Mar 16 2026