
BERT Tokenizer 🤖
BERT introduced the WordPiece tokenization strategy, which revolutionized how models handle rare words by breaking them into meaningful sub-units.
BERT (Bidirectional Encoder Representations from Transformers) uses a subword tokenization strategy called WordPiece. Unlike the BPE used by GPT models, WordPiece is designed to optimize for the likelihood of the training data, making it highly effective at handling out-of-vocabulary words.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
🚀 The Core Concept
WordPiece balances between character-level and word-level representations. It works by:
- Likelihood Optimization: Iteratively adding symbols to the vocabulary that maximize the likelihood of the training data.
- Subword Splitting: Rare words are broken into smaller pieces. Pieces that are not the start of a word are prefixed with
##. - Special Tokens: BERT relies on specific markers like
[CLS](start of a sequence) and[SEP](separator between sentences) for its bidirectional processing.
1. Environment Setup
Before we start, ensure you have the transformers library installed, as it provides the standard implementation of the BERT tokenizer.
# !pip install transformers2. Loading the Tokenizer
We'll use the bert-base-uncased model, which is one of the most common versions of BERT. "Uncased" means it treats all text as lowercase.
from transformers import BertTokenizer
# load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')/Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm3. Inspecting the Tokenizer
The tokenizer object has many methods and properties. We can use dir() to see what's available under the hood.
# inspect the tokenizer info
dir(tokenizer)['SPECIAL_TOKENS_ATTRIBUTES',
'__annotations__',
'__call__',
'__class__',
'__delattr__',4. Exploring the Vocabulary
Let's take a peek at some tokens in the middle of BERT's vocabulary. You'll notice many tokens starting with ##, which indicates they are subword pieces.
all_tokens = list(tokenizer.get_vocab().keys())
all_tokens[20000:20100]['chunk',
'rigorous',
'blaine',
'198',
'peabody',5. Checking Specific Tokens
We can verify the exact ID of a word like "science" and see the total vocabulary size.
print(tokenizer.vocab_size)
tokenizer.get_vocab()['science']3052226716. Tokenizing vs. Encoding
There's a subtle difference between getting the ID of a word and tokenizing it. If a word is in the vocabulary, its ID should match.
word = 'science'
res1 = tokenizer.convert_tokens_to_ids(word)
res2 = tokenizer.get_vocab()[word]
print(res1)
print(res2)2671
26717. The Limits of Direct Access
Trying to get the ID for a full sentence through the vocabulary dictionary will fail because the vocabulary only contains individual tokens.
text = 'science is great'
res1 = tokenizer.convert_tokens_to_ids(text)
res2 = tokenizer.get_vocab()[text]
print(res1)
print(res2)KeyError: 'science is great'
---------------------------------------------------------------------------
KeyError Traceback (most recent call last)
Cell In[8], line 4
1 text = 'science is great'8. Proper Encoding
The encode method handles the full pipeline: tokenizing the text, adding special tokens ([CLS] and [SEP]), and converting them into IDs.
# better method:
res3 = tokenizer.encode(text)
for i in res3:
print(f'Token {i} is "{tokenizer.decode(i)}"')
# [CLS] = classification
# [SEP] = sentence separation
print('')
print(tokenizer.decode(res3,skip_special_tokens=True))
print(tokenizer.decode(res3,skip_special_tokens=False))Token 101 is "[CLS]"
Token 2671 is "science"
Token 2003 is "is"
Token 2307 is "great"
Token 102 is "[SEP]"9. Special Tokens in Action
Notice how every time you encode text, BERT wraps it in special tokens. If you encode and decode repeatedly, you'll accumulate these markers.
# BERT adds [CLS]...[SEP] with each encode
tokenizer.decode(tokenizer.encode(tokenizer.decode(tokenizer.encode( text ))))'[CLS] [CLS] science is great [SEP] [SEP]'10. Direct Call Syntax
You can also call the tokenizer object directly like a function to get a full dictionary of inputs required for the model.
tokenizer(text){'input_ids': [101, 2671, 2003, 2307, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}11. End-to-End Workflow
Finally, let's look at the full workflow: tokenizing a sentence into its segments, converting those segments to IDs, and comparing that to a direct encoding from the original text.
sentence = 'AI is both exciting and terrifying.'
print('Original sentence:')
print(f' {sentence}\n')
# segment the text into tokens
tokenized = tokenizer.tokenize(sentence)
print('Tokenized (segmented) sentence:')
print(f' {tokenized}')
# encode the tokenized sentence
ids_from_tokens = tokenizer.convert_tokens_to_ids(tokenized)
print(f' {ids_from_tokens}\n')
# and finally, encode from the original sentence
encodedText = tokenizer.encode(sentence)
print('Encoded from the original text:')
print(f' {encodedText}\n\n')
# now for decoding
print('Decoded from token-wise encoding:')
print(f' {tokenizer.decode(ids_from_tokens)}\n')
print('Decoded from text encoding:')
print(f' {tokenizer.decode(encodedText)}')Original sentence:
AI is both exciting and terrifying.
Tokenized (segmented) sentence:
['ai', 'is', 'both', 'exciting', 'and', 'terrifying', '.']