Text Preprocessing for NLP
Text preprocessing is the fundamental first step in Natural Language Processing (NLP) that transforms raw text into a structured format suitable for machine learning. This chapter demonstrates the core preprocessing techniques: tokenization, stop words filtering, and stemming.
Theory
The NLP Preprocessing Pipeline
Raw text data is noisy and unstructured. A typical preprocessing pipeline includes:
- Tokenization: Split text into individual units (words, characters)
- Normalization: Convert to lowercase, handle punctuation
- Stop Words Filtering: Remove common words with little semantic value
- Stemming/Lemmatization: Reduce words to their root form
- Vectorization: Convert text to numerical features (TF-IDF, embeddings)
Tokenization
Definition: The process of breaking text into smaller units called tokens.
Tokenization Strategies:
-
Whitespace Tokenization: Split on Unicode whitespace (spaces, tabs, newlines)
"Hello, world!" → ["Hello,", "world!"] -
Word Tokenization: Split on whitespace and separate punctuation
"Hello, world!" → ["Hello", ",", "world", "!"] -
Character Tokenization: Split into individual characters
"NLP" → ["N", "L", "P"]
Stop Words Filtering
Stop words are common words (e.g., "the", "is", "at", "on") that:
- Appear frequently in text
- Carry minimal semantic meaning
- Can be removed to reduce noise and computational cost
Example:
Input: "The quick brown fox jumps over the lazy dog"
Output: ["quick", "brown", "fox", "jumps", "lazy", "dog"]
Benefits:
- Reduces vocabulary size by 30-50%
- Improves signal-to-noise ratio
- Speeds up downstream ML algorithms
- Focuses on content words (nouns, verbs, adjectives)
Stemming
Stemming reduces words to their root form by removing suffixes using heuristic rules.
Porter Stemming Algorithm: Applies sequential rules to strip common English suffixes:
- Plural removal: "cats" → "cat"
- Gerund removal: "running" → "run"
- Comparative removal: "happier" → "happi"
- Derivational endings: "happiness" → "happi"
Characteristics:
- Fast and simple (rule-based)
- May produce non-words ("studies" → "studi")
- Good enough for information retrieval and search
- Language-specific rules
vs. Lemmatization: Lemmatization uses dictionaries to return actual words ("running" → "run", "better" → "good"), but stemming is faster and often sufficient for ML tasks.
Example 1: Tokenization Strategies
Comparing different tokenization approaches for the same text.
use aprender::text::tokenize::{WhitespaceTokenizer, WordTokenizer, CharTokenizer};
use aprender::text::Tokenizer;
fn main() {
let text = "Hello, world! Natural Language Processing is amazing.";
// Whitespace tokenization
let whitespace_tokenizer = WhitespaceTokenizer::new();
let tokens = whitespace_tokenizer.tokenize(text).unwrap();
println!("Whitespace: {:?}", tokens);
// ["Hello,", "world!", "Natural", "Language", "Processing", "is", "amazing."]
// Word tokenization
let word_tokenizer = WordTokenizer::new();
let tokens = word_tokenizer.tokenize(text).unwrap();
println!("Word: {:?}", tokens);
// ["Hello", ",", "world", "!", "Natural", "Language", "Processing", "is", "amazing", "."]
// Character tokenization
let char_tokenizer = CharTokenizer::new();
let tokens = char_tokenizer.tokenize("NLP").unwrap();
println!("Character: {:?}", tokens);
// ["N", "L", "P"]
}
Output:
Whitespace: ["Hello,", "world!", "Natural", "Language", "Processing", "is", "amazing."]
Word: ["Hello", ",", "world", "!", "Natural", "Language", "Processing", "is", "amazing", "."]
Character: ["N", "L", "P"]
Analysis:
- Whitespace: 7 tokens, preserves punctuation
- Word: 10 tokens, separates punctuation
- Character: 3 tokens, character-level analysis
Example 2: Stop Words Filtering
Removing common words to reduce noise and improve signal.
use aprender::text::stopwords::StopWordsFilter;
use aprender::text::tokenize::WhitespaceTokenizer;
use aprender::text::Tokenizer;
fn main() {
let text = "The quick brown fox jumps over the lazy dog in the garden";
// Tokenize
let tokenizer = WhitespaceTokenizer::new();
let tokens = tokenizer.tokenize(text).unwrap();
println!("Original: {:?}", tokens);
// ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "in", "the", "garden"]
// Filter English stop words
let filter = StopWordsFilter::english();
let filtered = filter.filter(&tokens).unwrap();
println!("Filtered: {:?}", filtered);
// ["quick", "brown", "fox", "jumps", "lazy", "dog", "garden"]
let reduction = 100.0 * (1.0 - filtered.len() as f64 / tokens.len() as f64);
println!("Reduction: {:.1}%", reduction); // 41.7%
// Custom stop words
let custom_filter = StopWordsFilter::new(vec!["fox", "dog", "garden"]);
let custom_filtered = custom_filter.filter(&filtered).unwrap();
println!("Custom filtered: {:?}", custom_filtered);
// ["quick", "brown", "jumps", "lazy"]
}
Output:
Original: ["The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog", "in", "the", "garden"]
Filtered: ["quick", "brown", "fox", "jumps", "lazy", "dog", "garden"]
Reduction: 41.7%
Custom filtered: ["quick", "brown", "jumps", "lazy"]
Analysis:
- Removed 5 stop words ("the", "over", "in")
- 41.7% reduction in token count
- Custom filtering enables domain-specific preprocessing
Example 3: Stemming (Word Normalization)
Reducing words to their root form using Porter stemmer.
use aprender::text::stem::{PorterStemmer, Stemmer};
fn main() {
let stemmer = PorterStemmer::new();
// Single word stemming
println!("running → {}", stemmer.stem("running").unwrap()); // "run"
println!("studies → {}", stemmer.stem("studies").unwrap()); // "studi"
println!("happiness → {}", stemmer.stem("happiness").unwrap()); // "happi"
println!("easily → {}", stemmer.stem("easily").unwrap()); // "easili"
// Batch stemming
let words = vec!["running", "jumped", "flying", "studies", "cats", "quickly"];
let stemmed = stemmer.stem_tokens(&words).unwrap();
println!("Original: {:?}", words);
println!("Stemmed: {:?}", stemmed);
// ["run", "jump", "flying", "studi", "cat", "quickli"]
}
Output:
running → run
studies → studi
happiness → happi
easily → easili
Original: ["running", "jumped", "flying", "studies", "cats", "quickly"]
Stemmed: ["run", "jump", "flying", "studi", "cat", "quickli"]
Analysis:
- Normalizes word variations: "running"/"run", "studies"/"studi"
- May produce non-words: "happiness" → "happi"
- Groups semantically similar words together
- Reduces vocabulary size for ML models
Example 4: Complete Preprocessing Pipeline
End-to-end pipeline combining tokenization, normalization, filtering, and stemming.
use aprender::text::stem::{PorterStemmer, Stemmer};
use aprender::text::stopwords::StopWordsFilter;
use aprender::text::tokenize::WordTokenizer;
use aprender::text::Tokenizer;
fn main() {
let document = "The students are studying machine learning algorithms. \
They're analyzing different classification models and \
comparing their performances on various datasets.";
// Step 1: Tokenization
let tokenizer = WordTokenizer::new();
let tokens = tokenizer.tokenize(document).unwrap();
println!("Tokens: {} items", tokens.len()); // 21 tokens
// Step 2: Lowercase normalization
let lowercase_tokens: Vec<String> = tokens
.iter()
.map(|t| t.to_lowercase())
.collect();
// Step 3: Stop words filtering
let filter = StopWordsFilter::english();
let filtered_tokens = filter.filter(&lowercase_tokens).unwrap();
println!("After filtering: {} items", filtered_tokens.len()); // 16 tokens
// Step 4: Stemming
let stemmer = PorterStemmer::new();
let stemmed_tokens = stemmer.stem_tokens(&filtered_tokens).unwrap();
println!("Final: {:?}", stemmed_tokens);
// ["stud", "studi", "machin", "learn", "algorithm", ".", "they'r",
// "analyz", "differ", "classif", "model", "compar", "perform",
// "variou", "dataset", "."]
let reduction = 100.0 * (1.0 - stemmed_tokens.len() as f64 / tokens.len() as f64);
println!("Total reduction: {:.1}%", reduction); // 23.8%
}
Output:
Tokens: 21 items
After filtering: 16 items
Final: ["stud", "studi", "machin", "learn", "algorithm", ".", "they'r", "analyz", "differ", "classif", "model", "compar", "perform", "variou", "dataset", "."]
Total reduction: 23.8%
Pipeline Analysis:
| Stage | Token Count | Change |
|---|---|---|
| Original | 21 | - |
| Lowercase | 21 | 0% |
| Stop words | 16 | -23.8% |
| Stemming | 16 | 0% |
Key Transformations:
- "students" → "stud"
- "studying" → "studi"
- "machine" → "machin"
- "learning" → "learn"
- "algorithms" → "algorithm"
- "analyzing" → "analyz"
- "classification" → "classif"
Best Practices
When to Use Each Technique
Tokenization:
- Whitespace: Quick analysis, sentiment analysis
- Word: Most NLP tasks, classification, named entity recognition
- Character: Character-level models, language modeling
Stop Words Filtering:
- ✅ Information retrieval, topic modeling, keyword extraction
- ❌ Sentiment analysis (negation words like "not" matter)
- ❌ Question answering (question words like "what", "where")
Stemming:
- ✅ Search engines, information retrieval
- ✅ Text classification with large vocabularies
- ❌ Tasks requiring exact word meaning
- Consider lemmatization for better quality (at cost of speed)
Pipeline Recommendations
Fast & Simple (Search/Retrieval):
Text → Whitespace → Lowercase → Stop words → Stemming
High Quality (Classification):
Text → Word tokenization → Lowercase → Stop words → Lemmatization
Character-Level (Language Models):
Text → Character tokenization → No further preprocessing
Running the Example
cargo run --example text_preprocessing
The example demonstrates four scenarios:
- Tokenization strategies - Comparing whitespace, word, and character tokenizers
- Stop words filtering - English and custom stop word removal
- Stemming - Porter algorithm for word normalization
- Full pipeline - Complete preprocessing workflow
Key Takeaways
- Preprocessing is crucial: Directly impacts ML model performance
- Pipeline matters: Order of operations affects results
- Trade-offs exist: Speed vs. quality, simplicity vs. accuracy
- Domain-specific: Customize for your task (sentiment vs. search)
- Reproducibility: Same pipeline for training and inference
Next Steps
After preprocessing, text is ready for:
- Vectorization: Bag of Words, TF-IDF, word embeddings
- Feature engineering: N-grams, POS tags, named entities
- Model training: Classification, clustering, topic modeling
References
- Porter, M.F. (1980). "An algorithm for suffix stripping." Program, 14(3), 130-137.
- Manning, C.D., Raghavan, P., Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
- Jurafsky, D., Martin, J.H. (2023). Speech and Language Processing (3rd ed.).