
Preparing Text for Tokens ๐
In real-world applications, data rarely arrives in a clean, perfectly formatted state. Before we can convert text to numbers, we must perform several pre-processing steps to standardize the input and remove noise.
This content is adapted from A deep understanding of AI language model mechanisms. It has been curated and organized for educational purposes on this portfolio. No copyright infringement is intended.
๐ The Core Concept
Pre-processing is the critical bridge between raw, messy data and numerical token IDs:
- Data Fetching: Retrieving raw text from digital repositories and handling connection states.
- Cleaning & Regex: Using regular expressions to systematically replace formatting artifacts and special quotes.
- Normalization: Standardizing the text by removing non-essential information (numbers, non-ASCII) and unifying the case.
- Robust Mapping: Wrapping logic into reusable functions to define a clear boundary for the model's input.
1. Fetching Raw Data
Most LLM training pipelines start by web-scraping or fetching text from large repositories like Project Gutenberg. Here, we fetch the complete text of The Time Machine.
import requests
# Fetching 'The Time Machine' by H. G. Wells
response = requests.get('https://www.gutenberg.org/files/35/35-0.txt')
raw_text = response.text2. Cleaning Up the Text
Raw text often contains formatting artifacts, special characters, and non-ASCII encoding that can confuse a simple tokenizer. We'll use a series of steps to sanitize our data.
Raw text often contains formatting artifacts, special characters, and non-ASCII encoding that can confuse a simple tokenizer.
1. Removing Special Characters
We use regular expressions to replace newlines, tabs, and specialized quotes with standard spaces.
import re
# Character strings to replace
strings_to_replace = ['\r\n', 'โ', '_', 'โ', 'โ']
for s in strings_to_replace:
raw_text = re.sub(s, ' ', raw_text)2. Stripping Non-ASCII & Numbers
To keep our initial vocabulary manageable, we often remove complex characters and numerical digits.
# Remove non-ASCII characters
text = re.sub(r'[^\x00-\x7F]+', ' ', raw_text)
# Remove numbers
text = re.sub(r'\d+', '', text)3. Case Normalization
Converting everything to lowercase prevents the model from treating "Machine" and "machine" as two separate concepts.
text = text.lower()3. Advanced Parsing
While splitting by whitespace is a start, we must also handle punctuation. We don't want "machine." (with a period) to be seen as a different token than "machine" (without).
text = " Hello, world! "
# Remove whitespace and punctuation
text = text.strip().lower().replace('!', '')
pattern = fr'[{string.punctuation}\s]+'
# Split and clean
words = [w.strip() for w in re.split(pattern, text) if len(w.strip()) > 1]4. Robust Encoding Functions
As our workflow becomes more complex, it's helpful to wrap our encoding and decoding logic into dedicated functions for reusability and clarity.
As our workflow becomes more complex, it's helpful to wrap our encoding and decoding logic into dedicated functions.
import numpy as np
def encoder(word_list, word2idx):
# Initialize a numerical vector
indices = np.zeros(len(word_list), dtype=int)
for i, word in enumerate(word_list):
indices[i] = word2idx[word]
return indices
def decoder(indices, idx2word):
# Reconstruct the string
return ' '.join([idx2word[i] for i in indices if i in idx2word])Testing the Pipeline
# Create maps
vocab = sorted(set(words))
word2idx = {w: i for i, w in enumerate(vocab)}
idx2word = {i: w for i, w in enumerate(vocab)}
# Encode-then-Decode check
original_phrase = ['the', 'time', 'machine']
encoded = encoder(original_phrase, word2idx)
decoded = decoder(encoded, idx2word)
[4042 4109 2416]
'the time machine'
<Callout type="tip">
By cleaning the text first, we've reduced the complexity the model has to
learn, allowing it to focus on the semantic relationships between words.
</Callout>