
GPT-4 Tokenizer
To scale beyond simple word-splitting, modern LLMs like GPT-4 use sophisticated subword tokenization. We will use OpenAI's `tiktoken` library to dissect how these models perceive and process language.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
🚀 The Core Concept
While we previously built a simple word-level tokenizer, tiktoken (used by GPT-4) uses Byte-Pair Encoding (BPE) to handle the complexities of real-world text. Our exploration covers:
- Tiktoken Setup: Initializing the
cl100k_baseencoding. - Vocabulary Depth: Exploring a 100,277-token universe.
- Tokenization Logic: Seeing how punctuation and subwords are handled.
- Statistical Insights: Visualizing how token lengths vary across the entire vocabulary.
1. Setup & Imports
To use GPT-4's tokenizer, we need the tiktoken library. We'll also import numpy and matplotlib for analysis and visualization.
import numpy as np
import matplotlib.pyplot as plt
# matplotlib defaults
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')# need to install the tiktoken library to get OpenAI's tokenizer
# note: it's tik-token, not tiktok-en :P
!pip install tiktoken
import tiktokenRequirement already satisfied: tiktoken in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (0.9.0)
Requirement already satisfied: regex>=2022.1.18 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2024.11.6)
Requirement already satisfied: requests>=2.26.0 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2.32.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.10)2. Loading the GPT-4 Encoding
OpenAI provides several encodings. For GPT-4 (and GPT-3.5), the standard is cl100k_base. Let's initialize it and see what's inside.
# GPT-4's tokenizer
tokenizer = tiktoken.get_encoding('cl100k_base')
dir(tokenizer)['__class__',
'__delattr__',
'__dict__',
'__dir__',
'__doc__',# get help
tokenizer??Type: Encoding
String form: <Encoding 'cl100k_base'>
File: ~/.pyenv/versions/3.12.6/lib/python3.12/site-packages/tiktoken/core.py
Source:
class Encoding:3. Exploring Vocabulary Size
One of the reasons GPT-4 is so capable is its massive vocabulary. Unlike our simple word-level tokenizer, tiktoken manages over 100,000 unique tokens.
# vocab size
tokenizer.n_vocab1002774. Special Tokens
BPE tokenizers use "special" tokens for specific purposes, like marking the end of a text string (<|endoftext|>).
tokenizer.decode([tokenizer.eot_token])'<|endoftext|>'# but not all tokens are valid, e.g.,
print(tokenizer.n_vocab)
tokenizer.decode([100261])100277KeyError: 'Invalid token for decoding: 100277'
KeyError Traceback (most recent call last)
Cell In[9], line 3
1 # but not all tokens are valid, e.g.,# list of all tokens:
# https://github.com/vnglst/gpt4-tokens/blob/main/decode-tokens.ipynbExplore some tokens
5. Exploring Individual Tokens
Let's look at what the first 50 tokens (from index 1000) actually represent. Notice how many tokens are actually pieces of words or common suffixes like "ception" or "include".
for i in range(1000,1050):
print(f'{i} = {tokenizer.decode([i])}')1000 = indow
1001 = lement
1002 = pect
1003 = ash
1004 = [iTokenization!
6. Tokenization in Practice
Now, let's see how a full sentence is broken down. We'll encode a string and then inspect how each "word" is actually composed of one or more tokens.
text = "My name is Mike and I like toothpaste-flavored chocolate."
tokens = tokenizer.encode(text)
print(tokens)[5159, 836, 374, 11519, 323, 358, 1093, 26588, 57968, 12556, 76486, 18414, 13]text.split()['My',
'name',
'is',
'Mike',
'and',for word in text.split():
print(f'"{word}" comprises token(s) {tokenizer.encode(word)}')"My" comprises token(s) [5159]
"name" comprises token(s) [609]
"is" comprises token(s) [285]
"Mike" comprises token(s) [35541]
"and" comprises token(s) [438]for t in tokens:
print(f'Token {t:>6} is "{tokenizer.decode([t])}"')Token 5159 is "My"
Token 836 is " name"
Token 374 is " is"
Token 11519 is " Mike"
Token 323 is " and"# with special (non-ASCII) characters
tokenizer.encode('â')[9011]7. Token Length Distribution
To understand the tokenizer's complexity, we can visualize the distribution of token lengths. Most tokens are between 3 and 6 characters long, which is the "sweet spot" for common subword units.
# initialize lengths vector
token_lengths = np.zeros(tokenizer.n_vocab)
# get the number of characters in each token
for idx in range(tokenizer.n_vocab):
try:
token_lengths[idx] = len(tokenizer.decode([idx]))
except:
token_lengths[idx] = np.nan
# count unique lengths
uniqueLengths,tokenCount = np.unique(token_lengths,return_counts=True)
# visualize
_,axs = plt.subplots(1,2,figsize=(12,4))
axs[0].plot(token_lengths,'k.',markersize=3,alpha=.4)
axs[0].set(xlim=[0,tokenizer.n_vocab],xlabel='Token index',ylabel='Token length (characters)',
title='GPT4 token lengths')
axs[1].bar(uniqueLengths,tokenCount,color='k',edgecolor='gray')
axs[1].set(xlim=[0,max(uniqueLengths)],xlabel='Token length (chars)',ylabel='Token count (log scale)',
title='Distribution of token lengths')
plt.tight_layout()
plt.show()Many word-tokens start with spaces
8. The Power of Leading Spaces
In BPE, a space prefix is often treated as part of the token itself. This is why " Michael" and "Michael" result in different token IDs.
# single-token words with vs. without spaces
print( tokenizer.encode(' Michael') )
print( tokenizer.encode('Michael') )[8096]
[26597]# multi-token words without a space
print( tokenizer.encode(' Peach') )
print( tokenizer.encode('Peach') )[64695]
[47, 9739]peach = tokenizer.encode('Peach')
[tokenizer.decode([p]) for p in peach]['P', 'each']9. Scaling to a Full Book
Finally, let's see how the tokenizer performs on a large corpus. We'll download "The Time Machine" from Project Gutenberg and encode the entire text.
import requests
import re
text = requests.get('https://www.gutenberg.org/files/35/35-0.txt').text
# split by punctuation
words = re.split(r'([,.:;—?_!"“()\']|--|\s)',text)
words = [item.strip() for item in words if item.strip()]
print(f'There are {len(words)} words.')
words[10000:10050]There are 37786 words.['I',
'was',
'not',
'loath',
'to',# tokens of a random word in the text
someRandomWord = np.random.choice(words)
print(f'"{someRandomWord}" has token {tokenizer.encode(someRandomWord)}')"has" has token [4752]for t in words[:20]:
print(f'"{t}" has {len(tokenizer.encode(t))} tokens')"***" has 1 tokens
"START" has 1 tokens
"OF" has 1 tokens
"THE" has 1 tokens
"PROJECT" has 1 tokensfor spelling in ['book','Book','bOok']:
print(f'"{spelling}" has tokens {tokenizer.encode(spelling)}')"book" has tokens [2239]
"Book" has tokens [7280]
"bOok" has tokens [65, 46, 564]But do we need to separate the text into words?
# what happens if we just tokenize the raw (unprocessed) text?
tmTokens = tokenizer.encode(text)
print(f'The text has {len(tmTokens):,} tokens and {len(words):,} words.')The text has 43,053 tokens and 37,786 words.# check out some tokens
for t in tmTokens[9990:10020]:
print(f'Token {t:>6}: "{tokenizer.decode([t])}"')Token 264: " a"
Token 3094: " step"
Token 4741: " forward"
Token 11: ","
Token 20365: " hes"print(tokenizer.decode(tmTokens[9990:10020])) a step forward, hesitated, and then touched my
hand. Then I felt other soft little tentacles upon my back and
shoulders.