AI

Coding Challenge: Tokenizing The Time Machine ๐Ÿ•ฐ๏ธ

To understand the scale of modern language modeling, we must move beyond short phrases and process a full-length book. In this challenge, we'll tokenize H.G. Wells' The Time Machine.

Mar 202510 min read

Coding Challenge: Tokenizing "The Time Machine" ๐Ÿ•ฐ๏ธ

In this challenge, we step up our game by processing a full-length book: H.G. Wells' "The Time Machine". This exercise covers advanced text cleaning, building a larger vocabulary, and a fun experiment: decoding a "random walk" through the book.

๐ŸŒ
References & Disclaimer

This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.


๐Ÿš€ The Core Concept

Processing a real-world corpus introduces several complexities that short examples hide:

  1. Massive Scale: Managing thousands of tokens and ensuring memory efficiency.
  2. Advanced Cleaning: Handling non-ASCII characters, formatting artifacts, and specialized punctuation.
  3. Lexicon Building: Constructing a robust word2idx mapping for an entire literary work.
  4. Random Decoding: Verifying our token mapping through random sampling (the "Random Walk").

1. Fetching & Cleaning the Text

We'll fetch the raw text directly from Project Gutenberg and apply a series of standard pre-processing steps, including removing non-ASCII characters and normalizing whitespace.

import requests
import re
import string
 
# Fetch raw text
url = 'https://www.gutenberg.org/files/35/35-0.txt'
text = requests.get(url).text
 
# Advanced cleaning: handle special chars and non-ASCII
strings2replace = [ '\\r\\n\\r\\n', '\\r\\n', '_', 'รข\\x80\\x9c', 'รข\\x80\\x9d' ]
for str2match in strings2replace:
  text = re.compile(r'%s'%str2match).sub(' ', text)
 
# Remove numbers and convert to lowercase
text = re.sub(r'[^\\x00-\\x7F]+', ' ', text)
text = re.sub(r'\\d+', '', text).lower()

2. Building the Lexicon

With our text cleaned, we can now extract all unique "tokens" (words) and create our mapping dictionaries (word2idx and idx2word).

# Split into words (ignoring punctuation)
words = re.split(fr'[{string.punctuation}\\s]+', text)
words = [w.strip() for w in words if len(w.strip()) > 1]
 
# Create the Lexicon
vocab = sorted(set(words))
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for i, w in enumerate(vocab)}
 
Execution Output
Total words: 32742
Unique tokens (Lexicon): 6134

3. Decoding & Random Walk

Once the text is tokenized, we can manipulate it mathematically. A "random walk" picks random token IDs and decodes them, showing the breadth of the book's vocabulary.

import numpy as np
 
# Pick 10 random token IDs
random_tokens = np.random.randint(0, len(vocab), 10)
 
# Decode them!
decoded_text = ' '.join([idx2word[i] for i in random_tokens])
Execution Output
Decoded: mahogany through after however though for again before above after

4. Token Density Visualization

Visualizing the density of tokens throughout the book helps identify repetitive patterns or unique sections. This is a common practice in exploratory data analysis (EDA).

Time Machine Visualization


๐Ÿ’ก Key Takeaway

By scaling up to a full book, we start to see the limitations of simple Word Tokenization:

  1. Vocabulary Size: The lexicon can grow extremely large.
  2. Out-of-Vocabulary (OOV): New text will almost certainly contain words we haven't seen.
  3. Efficiency: Storing every unique word as a distinct ID becomes computationally expensive as the corpus grows.

In the next module, we will explore Subword Tokenization (BPE) to solve these scaling issues!

ยฉ 2026 Driptanil Datta. All rights reserved.

Software Developer & Engineer

Disclaimer:The content provided on this blog is for educational and informational purposes only. While I strive for accuracy, all information is provided "as is" without any warranties of completeness, reliability, or accuracy. Any action you take upon the information found on this website is strictly at your own risk.

Copyright & IP:Certain technical content, interview questions, and datasets are curated from external educational sources to provide a centralized learning resource. Respect for original authorship is maintained; no copyright infringement is intended. All trademarks, logos, and brand names are the property of their respective owners.

System Operational

Built with Love โค๏ธ | Last updated: Mar 16 2026