BERT Character Counts 🧩

In this challenge, we'll dive deep into the BERT vocabulary to count the occurrence of every character. This analysis reveals the underlying patterns in how BERT represents language.

Driptanil DattaSoftware Developer

Mar 202510 min read

Now that we understand how the BERT tokenizer functions, let's analyze its vocabulary. A great way to understand a tokenizer's "bias" or "preference" is to look at the frequency of individual characters across its entire lexicon.

🌍

References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.

🚀 The Core Concept

In this challenge, we'll iterate through all 30,522 tokens in the BERT vocabulary to see which characters are most prevalent. We'll:

Lexicon Iteration: Scan every token in the BERT bert-base-uncased vocabulary.
Character Filtering: Count only lowercase letters and digits, while explicitly ignoring "unused" placeholder tokens.
Visualization: Use Matplotlib to create a distribution plot that helps us visually identify the "alphabet" of BERT's world.

1. Environment Setup

We'll use numpy for counting, string for easy access to character sets, and matplotlib for visualization.

import numpy as np
import string
 
import matplotlib.pyplot as plt
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')

2. Loading the Tokenizer

We'll start by loading the standard BERT vocabulary.

# load BERT tokenizer
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

Execution Output

/Users/drippy/.pyenv/versions/3.12.6/lib/python3.12/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
  from .autonotebook import tqdm as notebook_tqdm

3. The Counting Logic

We define our search space (digits + lowercase letters) and then iterate through each character, summing its occurrences across all non-unused tokens.

# set of digits and letters
digitsLetters = string.digits + string.ascii_lowercase
 
# initialize results vector
charCount = np.zeros(len(digitsLetters),dtype=int)
 
# count the appearances (excluding "unused")
for i,c in enumerate(digitsLetters):
  charCount[i] = np.sum([ c in tok for tok in tokenizer.vocab.keys() if not 'unused' in tok ])

4. Plotting the Distribution

Visualizing the counts helps us immediately see the most common characters.

# and plot
plt.figure(figsize=(12,3))
plt.bar(range(len(charCount)),charCount,color=[.7,.7,.7],edgecolor='k')
 
plt.gca().set(xticks=range(len(charCount)),xticklabels=list(digitsLetters),
              xlim=[-.6,len(charCount)-.4],xlabel='Character',ylabel='Count',
              title='Frequency of characters in BERT tokens')
 
plt.show()

Output 1

5. Final Report

Let's sort our findings and report the top characters. Notice how 'e', 'a', and 'i' dominate, mirroring their frequency in natural English text.

charOrder = np.argsort(charCount)[::-1]
 
for i in charOrder:
  print(f'"{digitsLetters[i]}" appears in {charCount[i]:6,} tokens.')

Execution Output

"e" appears in 14,633 tokens.
"a" appears in 12,381 tokens.
"i" appears in 11,614 tokens.
"r" appears in 10,991 tokens.
"n" appears in 10,735 tokens.

8 Bert Tokenizer 10 Token Translation