
Translating Between Tokenizers
Tokenization is not a universal language. In this lesson, we explore why you can't simply pass tokens from one model to another and how to correctly bridge these distinct symbolic spaces.
Different models live in different "symbolic universes." GPT-4 and BERT use entirely different vocabularies and tokenization algorithms. If you try to pass GPT-4 tokens directly to a BERT model, the result will be nonsensical.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
๐ The Core Concept
To correctly move data between two different models, you must understand that:
- Incompatibility: Token IDs are model-specific. ID
9906in GPT-4 means something completely different in BERT. - The Raw Text Bridge: The only safe way to "translate" is to decode the source tokens back into raw text and then re-tokenize that text using the target model's tokenizer.
- Compression Variance: Different tokenization strategies (BPE vs WordPiece) will result in different token counts for the exact same input string.
1. Importing the Tokenizers
We'll compare tiktoken (used by GPT-4) and the transformers implementation of BERT.
# GPT4
# !pip install tiktoken
import tiktoken
gpt4Tokenizer = tiktoken.get_encoding('cl100k_base')
# BERT
from transformers import BertTokenizer
bertTokenizer = BertTokenizer.from_pretrained('bert-base-uncased')Requirement already satisfied: tiktoken in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (0.9.0)
Requirement already satisfied: regex>=2022.1.18 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2024.11.6)
Requirement already satisfied: requests>=2.26.0 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2.32.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.10)/Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm2. The Wrong Way: Direct ID Swapping
Let's see what happens if we take tokens generated by GPT-4 and try to decode them using the BERT tokenizer. The result is pure gibberish.
# issue is that they have different tokenizers, so needs to be translated into text and re-tokenized
startingtext = 'Hello, my name is Mike and I like purple.'
# GPT4's tokens:
gpt4Toks = gpt4Tokenizer.encode(startingtext)
# bert's tokens
bertToks = bertTokenizer.encode(startingtext)
print(f'Starting text:\n{startingtext}')
print(f'\n\nGPT4 tokens:\n{gpt4Toks}')
print(f"\nDecoded using GPT4:\n{gpt4Tokenizer.decode(gpt4Toks)}")
print(f"\nDecoded using BERT:\n{bertTokenizer.decode(gpt4Toks)}")
print(f'\n\nBERT tokens:\n{bertToks}')
print(f"\nDecoded using BERT:\n{bertTokenizer.decode(bertToks)}")
print(f"\nDecoded using GPT4:\n{gpt4Tokenizer.decode(bertToks)}")Starting text:
Hello, my name is Mike and I like purple.
GPT4 tokens:3. The Right Way: Token-to-Text-to-Token
To transition between models, you must use the raw text as a bridge.
# text -> GPT4 tokens -> text -> BERT tokens
# 1) to GPT4 tokens
startingtext = 'Hello, my name is Mike and I like purple.'
gpt4Toks = gpt4Tokenizer.encode(startingtext)
# 2) back to text
gpt4ReconText = gpt4Tokenizer.decode(gpt4Toks)
# 3) then to bert tokens
bertToks = bertTokenizer.encode(gpt4ReconText)
# 4) show the reconstruction
bertTokenizer.decode(bertToks)'[CLS] hello, my name is mike and i like purple. [SEP]'4. Comparing Compression Rates
Different tokenizers have different efficiencies. For the same text, BERT often generates more tokens than GPT-4 because its subword pieces (WordPiece) are typically smaller than BPE pieces.
# warning about sizes:
txt = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'
print(f'Text contains {len(txt)} characters,')
print(f' {len(gpt4Tokenizer.encode(txt))} GPT4 tokens, and')
print(f' {len(bertTokenizer.encode(txt))} Bert tokens.')Text contains 445 characters,
96 GPT4 tokens, and
160 Bert tokens.5. Hidden Complexity in Whitespace
Whitespace and special characters are handled very differently across tokenizers. BERT tends to normalize or strip certain sequences, while GPT-4 preserves more of the original layout.
# another source of confusion:
txt = 'start\r\n\r\n\r\n\n\r\n\r\n\t\t\t\n\r\n\rend'
# txt = 'start\t\t\t\t\t\t\tend'
# txt = 'start end'
bertToks = bertTokenizer.encode(txt)
gpt4Toks = gpt4Tokenizer.encode(txt)
print(f'Reconstruction in BERT:\n {bertToks}\n {bertTokenizer.decode(bertToks)}\n')
print(f'Reconstruction in GPT4:\n {gpt4Toks}\n {gpt4Tokenizer.decode(gpt4Toks)}')Reconstruction in BERT:
[101, 2707, 2203, 102]
[CLS] start end [SEP]
Reconstruction in GPT4: