BERT Tokenizer 🤖

BERT introduced the WordPiece tokenization strategy, which revolutionized how models handle rare words by breaking them into meaningful sub-units.

Driptanil DattaSoftware Developer

Mar 202510 min read

BERT (Bidirectional Encoder Representations from Transformers) uses a subword tokenization strategy called WordPiece. Unlike the BPE used by GPT models, WordPiece is designed to optimize for the likelihood of the training data, making it highly effective at handling out-of-vocabulary words.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🚀 The Core Concept

WordPiece balances between character-level and word-level representations. It works by:

Likelihood Optimization: Iteratively adding symbols to the vocabulary that maximize the likelihood of the training data.
Subword Splitting: Rare words are broken into smaller pieces. Pieces that are not the start of a word are prefixed with ##.
Special Tokens: BERT relies on specific markers like [CLS] (start of a sequence) and [SEP] (separator between sentences) for its bidirectional processing.

1. Environment Setup

Before we start, ensure you have the transformers library installed, as it provides the standard implementation of the BERT tokenizer.

# !pip install transformers

2. Loading the Tokenizer

We'll use the bert-base-uncased model, which is one of the most common versions of BERT. "Uncased" means it treats all text as lowercase.

from transformers import BertTokenizer
 
# load BERT tokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Execution Output

/Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

3. Inspecting the Tokenizer

The tokenizer object has many methods and properties. We can use dir() to see what's available under the hood.

# inspect the tokenizer info
dir(tokenizer)

Execution Output

['SPECIAL_TOKENS_ATTRIBUTES',
 '__annotations__',
 '__call__',
 '__class__',
 '__delattr__',

4. Exploring the Vocabulary

Let's take a peek at some tokens in the middle of BERT's vocabulary. You'll notice many tokens starting with ##, which indicates they are subword pieces.

all_tokens = list(tokenizer.get_vocab().keys())
all_tokens[20000:20100]

Execution Output

['chunk',
 'rigorous',
 'blaine',
 '198',
 'peabody',

5. Checking Specific Tokens

We can verify the exact ID of a word like "science" and see the total vocabulary size.

print(tokenizer.vocab_size)
tokenizer.get_vocab()['science']

Execution Output

6. Tokenizing vs. Encoding

There's a subtle difference between getting the ID of a word and tokenizing it. If a word is in the vocabulary, its ID should match.

word = 'science'
 
res1 = tokenizer.convert_tokens_to_ids(word)
res2 = tokenizer.get_vocab()[word]
 
print(res1)
print(res2)

Execution Output

2671
2671

7. The Limits of Direct Access

Trying to get the ID for a full sentence through the vocabulary dictionary will fail because the vocabulary only contains individual tokens.

text = 'science is great'
 
res1 = tokenizer.convert_tokens_to_ids(text)
res2 = tokenizer.get_vocab()[text]
 
print(res1)
print(res2)

Execution Output

KeyError: 'science is great'
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[8], line 4
      1 text = 'science is great'

8. Proper Encoding

The encode method handles the full pipeline: tokenizing the text, adding special tokens ([CLS] and [SEP]), and converting them into IDs.

# better method:
res3 = tokenizer.encode(text)
 
for i in res3:
  print(f'Token {i} is "{tokenizer.decode(i)}"')
 
# [CLS] = classification
# [SEP] = sentence separation
 
print('')
print(tokenizer.decode(res3,skip_special_tokens=True))
print(tokenizer.decode(res3,skip_special_tokens=False))

Execution Output

Token 101 is "[CLS]"
Token 2671 is "science"
Token 2003 is "is"
Token 2307 is "great"
Token 102 is "[SEP]"

9. Special Tokens in Action

Notice how every time you encode text, BERT wraps it in special tokens. If you encode and decode repeatedly, you'll accumulate these markers.

# BERT adds [CLS]...[SEP] with each encode
tokenizer.decode(tokenizer.encode(tokenizer.decode(tokenizer.encode( text ))))

Execution Output

'[CLS] [CLS] science is great [SEP] [SEP]'

10. Direct Call Syntax

You can also call the tokenizer object directly like a function to get a full dictionary of inputs required for the model.

tokenizer(text)

Execution Output

{'input_ids': [101, 2671, 2003, 2307, 102], 'token_type_ids': [0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1]}

11. End-to-End Workflow

Finally, let's look at the full workflow: tokenizing a sentence into its segments, converting those segments to IDs, and comparing that to a direct encoding from the original text.

sentence = 'AI is both exciting and terrifying.'
 
print('Original sentence:')
print(f'  {sentence}\n')
 
# segment the text into tokens
tokenized = tokenizer.tokenize(sentence)
print('Tokenized (segmented) sentence:')
print(f'  {tokenized}')
 
# encode the tokenized sentence
ids_from_tokens = tokenizer.convert_tokens_to_ids(tokenized)
print(f'  {ids_from_tokens}\n')
 
# and finally, encode from the original sentence
encodedText = tokenizer.encode(sentence)
print('Encoded from the original text:')
print(f'  {encodedText}\n\n')
 
# now for decoding
print('Decoded from token-wise encoding:')
print(f'  {tokenizer.decode(ids_from_tokens)}\n')
 
print('Decoded from text encoding:')
print(f'  {tokenizer.decode(encodedText)}')

Execution Output

Original sentence:
  AI is both exciting and terrifying.

Tokenized (segmented) sentence:
  ['ai', 'is', 'both', 'exciting', 'and', 'terrifying', '.']

7. Exploring GPT-4's Tokenizer 9 Bert Character Challenge