Translating Between Tokenizers

Tokenization is not a universal language. In this lesson, we explore why you can't simply pass tokens from one model to another and how to correctly bridge these distinct symbolic spaces.

Driptanil DattaSoftware Developer

Mar 202510 min read

Different models live in different "symbolic universes." GPT-4 and BERT use entirely different vocabularies and tokenization algorithms. If you try to pass GPT-4 tokens directly to a BERT model, the result will be nonsensical.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🚀 The Core Concept

To correctly move data between two different models, you must understand that:

Incompatibility: Token IDs are model-specific. ID 9906 in GPT-4 means something completely different in BERT.
The Raw Text Bridge: The only safe way to "translate" is to decode the source tokens back into raw text and then re-tokenize that text using the target model's tokenizer.
Compression Variance: Different tokenization strategies (BPE vs WordPiece) will result in different token counts for the exact same input string.

1. Importing the Tokenizers

We'll compare tiktoken (used by GPT-4) and the transformers implementation of BERT.

# GPT4
# !pip install tiktoken
import tiktoken
gpt4Tokenizer = tiktoken.get_encoding('cl100k_base')
 
# BERT
from transformers import BertTokenizer
bertTokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Execution Output

Requirement already satisfied: tiktoken in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (0.9.0)
Requirement already satisfied: regex>=2022.1.18 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2024.11.6)
Requirement already satisfied: requests>=2.26.0 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from tiktoken) (2.32.3)
Requirement already satisfied: charset-normalizer<4,>=2 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.4.1)
Requirement already satisfied: idna<4,>=2.5 in /Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages (from requests>=2.26.0->tiktoken) (3.10)

Execution Output

/Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

2. The Wrong Way: Direct ID Swapping

Let's see what happens if we take tokens generated by GPT-4 and try to decode them using the BERT tokenizer. The result is pure gibberish.

# issue is that they have different tokenizers, so needs to be translated into text and re-tokenized
startingtext = 'Hello, my name is Mike and I like purple.'
 
# GPT4's tokens:
gpt4Toks = gpt4Tokenizer.encode(startingtext)
 
# bert's tokens
bertToks = bertTokenizer.encode(startingtext)
 
print(f'Starting text:\n{startingtext}')
print(f'\n\nGPT4 tokens:\n{gpt4Toks}')
print(f"\nDecoded using GPT4:\n{gpt4Tokenizer.decode(gpt4Toks)}")
print(f"\nDecoded using BERT:\n{bertTokenizer.decode(gpt4Toks)}")
 
print(f'\n\nBERT tokens:\n{bertToks}')
print(f"\nDecoded using BERT:\n{bertTokenizer.decode(bertToks)}")
print(f"\nDecoded using GPT4:\n{gpt4Tokenizer.decode(bertToks)}")

Execution Output

Starting text:
Hello, my name is Mike and I like purple.

GPT4 tokens:

3. The Right Way: Token-to-Text-to-Token

To transition between models, you must use the raw text as a bridge.

# text -> GPT4 tokens -> text -> BERT tokens
 
# 1) to GPT4 tokens
startingtext = 'Hello, my name is Mike and I like purple.'
gpt4Toks = gpt4Tokenizer.encode(startingtext)
 
# 2) back to text
gpt4ReconText = gpt4Tokenizer.decode(gpt4Toks)
 
# 3) then to bert tokens
bertToks = bertTokenizer.encode(gpt4ReconText)
 
# 4) show the reconstruction
bertTokenizer.decode(bertToks)

Execution Output

'[CLS] hello, my name is mike and i like purple. [SEP]'

4. Comparing Compression Rates

Different tokenizers have different efficiencies. For the same text, BERT often generates more tokens than GPT-4 because its subword pieces (WordPiece) are typically smaller than BPE pieces.

# warning about sizes:
txt = 'Lorem ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est laborum.'
print(f'Text contains {len(txt)} characters,')
print(f'              {len(gpt4Tokenizer.encode(txt))} GPT4 tokens, and')
print(f'              {len(bertTokenizer.encode(txt))} Bert tokens.')

Execution Output

Text contains 445 characters,
              96 GPT4 tokens, and
              160 Bert tokens.

5. Hidden Complexity in Whitespace

Whitespace and special characters are handled very differently across tokenizers. BERT tends to normalize or strip certain sequences, while GPT-4 preserves more of the original layout.

# another source of confusion:
txt = 'start\r\n\r\n\r\n\n\r\n\r\n\t\t\t\n\r\n\rend'
# txt = 'start\t\t\t\t\t\t\tend'
# txt = 'start                    end'
 
bertToks = bertTokenizer.encode(txt)
gpt4Toks = gpt4Tokenizer.encode(txt)
 
print(f'Reconstruction in BERT:\n  {bertToks}\n  {bertTokenizer.decode(bertToks)}\n')
print(f'Reconstruction in GPT4:\n  {gpt4Toks}\n  {gpt4Tokenizer.decode(gpt4Toks)}')

Execution Output

Reconstruction in BERT:
  [101, 2707, 2203, 102]
  [CLS] start end [SEP]

Reconstruction in GPT4:

9 Bert Character Challenge GloVe Embeddings 🧤