
Coding Challenge: Tokenizing The Time Machine ๐ฐ๏ธ
To understand the scale of modern language modeling, we must move beyond short phrases and process a full-length book. In this challenge, we'll tokenize H.G. Wells' The Time Machine.
Coding Challenge: Tokenizing "The Time Machine" ๐ฐ๏ธ
In this challenge, we step up our game by processing a full-length book: H.G. Wells' "The Time Machine". This exercise covers advanced text cleaning, building a larger vocabulary, and a fun experiment: decoding a "random walk" through the book.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
๐ The Core Concept
Processing a real-world corpus introduces several complexities that short examples hide:
- Massive Scale: Managing thousands of tokens and ensuring memory efficiency.
- Advanced Cleaning: Handling non-ASCII characters, formatting artifacts, and specialized punctuation.
- Lexicon Building: Constructing a robust
word2idxmapping for an entire literary work. - Random Decoding: Verifying our token mapping through random sampling (the "Random Walk").
1. Fetching & Cleaning the Text
We'll fetch the raw text directly from Project Gutenberg and apply a series of standard pre-processing steps, including removing non-ASCII characters and normalizing whitespace.
import requests
import re
import string
# Fetch raw text
url = 'https://www.gutenberg.org/files/35/35-0.txt'
text = requests.get(url).text
# Advanced cleaning: handle special chars and non-ASCII
strings2replace = [ '\\r\\n\\r\\n', '\\r\\n', '_', 'รข\\x80\\x9c', 'รข\\x80\\x9d' ]
for str2match in strings2replace:
text = re.compile(r'%s'%str2match).sub(' ', text)
# Remove numbers and convert to lowercase
text = re.sub(r'[^\\x00-\\x7F]+', ' ', text)
text = re.sub(r'\\d+', '', text).lower()2. Building the Lexicon
With our text cleaned, we can now extract all unique "tokens" (words) and create our mapping dictionaries (word2idx and idx2word).
# Split into words (ignoring punctuation)
words = re.split(fr'[{string.punctuation}\\s]+', text)
words = [w.strip() for w in words if len(w.strip()) > 1]
# Create the Lexicon
vocab = sorted(set(words))
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for i, w in enumerate(vocab)}
Total words: 32742
Unique tokens (Lexicon): 61343. Decoding & Random Walk
Once the text is tokenized, we can manipulate it mathematically. A "random walk" picks random token IDs and decodes them, showing the breadth of the book's vocabulary.
import numpy as np
# Pick 10 random token IDs
random_tokens = np.random.randint(0, len(vocab), 10)
# Decode them!
decoded_text = ' '.join([idx2word[i] for i in random_tokens])Decoded: mahogany through after however though for again before above after4. Token Density Visualization
Visualizing the density of tokens throughout the book helps identify repetitive patterns or unique sections. This is a common practice in exploratory data analysis (EDA).
๐ก Key Takeaway
By scaling up to a full book, we start to see the limitations of simple Word Tokenization:
- Vocabulary Size: Th
- Out-of-Vocabulary (OOV): New text will almost certainly contain words we haven't seen.
- Efficiency: Storing every unique word as a distinct ID becomes computationally expensive as the corpus grows.
In the next module, we will explore Subword Tokenization (BPE) to solve these scaling issues!