GPT-4 Tokenizer

To scale beyond simple word-splitting, modern LLMs like GPT-4 use sophisticated subword tokenization. We will use OpenAI's `tiktoken` library to dissect how these models perceive and process language.

Driptanil DattaSoftware Developer

Mar 202510 min read

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🚀 The Core Concept

While we previously built a simple word-level tokenizer, tiktoken (used by GPT-4) uses Byte-Pair Encoding (BPE) to handle the complexities of real-world text. Our exploration covers:

Tiktoken Setup: Initializing the cl100k_base encoding.
Vocabulary Depth: Exploring a 100,277-token universe.
Tokenization Logic: Seeing how punctuation and subwords are handled.
Statistical Insights: Visualizing how token lengths vary across the entire vocabulary.

1. Setup & Imports

To use GPT-4's tokenizer, we need the tiktoken library. We'll also import numpy and matplotlib for analysis and visualization.

import numpy as np
import matplotlib.pyplot as plt
 
# matplotlib defaults
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

# need to install the tiktoken library to get OpenAI's tokenizer
# note: it's tik-token, not tiktok-en :P
!pip install tiktoken
import tiktoken

Execution Output

Requirement already satisfied: tiktoken in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (0.9.0)
Requirement already satisfied: regex>=2022.1.18 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2024.11.6)
Requirement already satisfied: requests>=2.26.0 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2.32.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.10)

2. Loading the GPT-4 Encoding

OpenAI provides several encodings. For GPT-4 (and GPT-3.5), the standard is cl100k_base. Let's initialize it and see what's inside.

# GPT-4's tokenizer
tokenizer = tiktoken.get_encoding('cl100k_base')
dir(tokenizer)

Execution Output

['__class__',
 '__delattr__',
 '__dict__',
 '__dir__',
 '__doc__',

# get help
tokenizer??

Execution Output

Type:           Encoding
String form:    <Encoding 'cl100k_base'>
File:           ~/.pyenv/versions/3.12.6/lib/python3.12/site-packages/tiktoken/core.py
Source:        
class Encoding:

3. Exploring Vocabulary Size

One of the reasons GPT-4 is so capable is its massive vocabulary. Unlike our simple word-level tokenizer, tiktoken manages over 100,000 unique tokens.

# vocab size
tokenizer.n_vocab

Execution Output

4. Special Tokens

BPE tokenizers use "special" tokens for specific purposes, like marking the end of a text string (<|endoftext|>).

tokenizer.decode([tokenizer.eot_token])

Execution Output

'<|endoftext|>'

# but not all tokens are valid, e.g.,
print(tokenizer.n_vocab)
tokenizer.decode([100261])

Execution Output

KeyError: 'Invalid token for decoding: 100277'

KeyError Traceback (most recent call last)
Cell In[9], line 3
1 # but not all tokens are valid, e.g.,

# list of all tokens:
# https://github.com/vnglst/gpt4-tokens/blob/main/decode-tokens.ipynb

Explore some tokens

5. Exploring Individual Tokens

Let's look at what the first 50 tokens (from index 1000) actually represent. Notice how many tokens are actually pieces of words or common suffixes like "ception" or "include".

for i in range(1000,1050):
  print(f'{i} = {tokenizer.decode([i])}')

Execution Output

1000 = indow
1001 = lement
1002 = pect
1003 = ash
1004 = [i

Tokenization!

6. Tokenization in Practice

Now, let's see how a full sentence is broken down. We'll encode a string and then inspect how each "word" is actually composed of one or more tokens.

text = "My name is Mike and I like toothpaste-flavored chocolate."
tokens = tokenizer.encode(text)
print(tokens)

Execution Output

[5159, 836, 374, 11519, 323, 358, 1093, 26588, 57968, 12556, 76486, 18414, 13]

text.split()

Execution Output

['My',
 'name',
 'is',
 'Mike',
 'and',

for word in text.split():
  print(f'"{word}" comprises token(s) {tokenizer.encode(word)}')

Execution Output

"My" comprises token(s) [5159]
"name" comprises token(s) [609]
"is" comprises token(s) [285]
"Mike" comprises token(s) [35541]
"and" comprises token(s) [438]

for t in tokens:
  print(f'Token {t:>6} is "{tokenizer.decode([t])}"')

Execution Output

Token   5159 is "My"
Token    836 is " name"
Token    374 is " is"
Token  11519 is " Mike"
Token    323 is " and"

# with special (non-ASCII) characters
tokenizer.encode('â')

Execution Output

[9011]

7. Token Length Distribution

To understand the tokenizer's complexity, we can visualize the distribution of token lengths. Most tokens are between 3 and 6 characters long, which is the "sweet spot" for common subword units.

# initialize lengths vector
token_lengths = np.zeros(tokenizer.n_vocab)
 
# get the number of characters in each token
for idx in range(tokenizer.n_vocab):
  try:
    token_lengths[idx] = len(tokenizer.decode([idx]))
  except:
    token_lengths[idx] = np.nan
 
# count unique lengths
uniqueLengths,tokenCount = np.unique(token_lengths,return_counts=True)
 
 
 
# visualize
_,axs = plt.subplots(1,2,figsize=(12,4))
axs[0].plot(token_lengths,'k.',markersize=3,alpha=.4)
axs[0].set(xlim=[0,tokenizer.n_vocab],xlabel='Token index',ylabel='Token length (characters)',
           title='GPT4 token lengths')
 
axs[1].bar(uniqueLengths,tokenCount,color='k',edgecolor='gray')
axs[1].set(xlim=[0,max(uniqueLengths)],xlabel='Token length (chars)',ylabel='Token count (log scale)',
           title='Distribution of token lengths')
 
plt.tight_layout()
plt.show()

Output 1

Many word-tokens start with spaces

8. The Power of Leading Spaces

In BPE, a space prefix is often treated as part of the token itself. This is why " Michael" and "Michael" result in different token IDs.

# single-token words with vs. without spaces
print( tokenizer.encode(' Michael') )
print( tokenizer.encode('Michael') )

Execution Output

[8096]
[26597]

# multi-token words without a space
print( tokenizer.encode(' Peach') )
print( tokenizer.encode('Peach') )

Execution Output

[64695]
[47, 9739]

peach = tokenizer.encode('Peach')
[tokenizer.decode([p]) for p in peach]

Execution Output

['P', 'each']

9. Scaling to a Full Book

Finally, let's see how the tokenizer performs on a large corpus. We'll download "The Time Machine" from Project Gutenberg and encode the entire text.

import requests
import re
text = requests.get('https://www.gutenberg.org/files/35/35-0.txt').text
 
# split by punctuation
words = re.split(r'([,.:;—?_!"“()\']|--|\s)',text)
words = [item.strip() for item in words if item.strip()]
print(f'There are {len(words)} words.')
words[10000:10050]

Execution Output

There are 37786 words.

Execution Output

['I',
 'was',
 'not',
 'loath',
 'to',

# tokens of a random word in the text
someRandomWord = np.random.choice(words)
print(f'"{someRandomWord}" has token {tokenizer.encode(someRandomWord)}')

Execution Output

"has" has token [4752]

for t in words[:20]:
  print(f'"{t}" has {len(tokenizer.encode(t))} tokens')

Execution Output

"***" has 1 tokens
"START" has 1 tokens
"OF" has 1 tokens
"THE" has 1 tokens
"PROJECT" has 1 tokens

for spelling in ['book','Book','bOok']:
  print(f'"{spelling}" has tokens {tokenizer.encode(spelling)}')

Execution Output

"book" has tokens [2239]
"Book" has tokens [7280]
"bOok" has tokens [65, 46, 564]

But do we need to separate the text into words?

# what happens if we just tokenize the raw (unprocessed) text?
tmTokens = tokenizer.encode(text)
print(f'The text has {len(tmTokens):,} tokens and {len(words):,} words.')

Execution Output

The text has 43,053 tokens and 37,786 words.

# check out some tokens
 
for t in tmTokens[9990:10020]:
  print(f'Token {t:>6}: "{tokenizer.decode([t])}"')

Execution Output

Token    264: " a"
Token   3094: " step"
Token   4741: " forward"
Token     11: ","
Token  20365: " hes"

print(tokenizer.decode(tmTokens[9990:10020]))

Execution Output

 a step forward, hesitated, and then touched my
hand. Then I felt other soft little tentacles upon my back and
shoulders.

6. Coding Challenge: BPE Loop 8 Bert Tokenizer