Text Classification with TF-IDF

Text classification is the task of assigning predefined categories to text documents. Combined with TF-IDF vectorization, it enables practical applications like sentiment analysis, spam detection, and topic classification.

Theory

The Text Classification Pipeline

A complete text classification system consists of:

  1. Text Preprocessing: Tokenization, stop words, stemming
  2. Feature Extraction: Convert text to numerical features
  3. Model Training: Learn patterns from labeled data
  4. Prediction: Classify new documents

Feature Extraction Methods

Bag of Words (BoW):

  • Represents documents as word count vectors
  • Simple and effective baseline
  • Ignores word order and context
"cat dog cat" → [cat: 2, dog: 1]

TF-IDF (Term Frequency-Inverse Document Frequency):

  • Weights words by importance
  • Down-weights common words, up-weights rare words
  • Better performance than raw counts

TF-IDF Formula:

tfidf(t, d) = tf(t, d) × idf(t)
where:
  tf(t, d) = count of term t in document d
  idf(t) = log(N / df(t))
  N = total documents
  df(t) = documents containing term t

Example:

Document 1: "cat dog"
Document 2: "cat bird"
Document 3: "dog bird bird"

Term "cat": appears in 2/3 documents
  IDF = log(3/2) = 0.405

Term "bird": appears in 2/3 documents
  IDF = log(3/2) = 0.405

Term "dog": appears in 2/3 documents
  IDF = log(3/2) = 0.405

Classification Algorithms

Gaussian Naive Bayes:

  • Assumes features are independent (naive assumption)
  • Probabilistic classifier using Bayes' theorem
  • Fast training and prediction
  • Works well with high-dimensional sparse data

Logistic Regression:

  • Linear classifier with sigmoid activation
  • Learns feature weights via gradient descent
  • Produces probability estimates
  • Robust and interpretable

Example 1: Sentiment Classification with Bag of Words

Binary sentiment analysis (positive/negative) using word counts.

use aprender::classification::GaussianNB;
use aprender::text::vectorize::CountVectorizer;
use aprender::text::tokenize::WhitespaceTokenizer;
use aprender::traits::Estimator;

fn main() {
    // Training data: movie reviews
    let train_docs = vec![
        "this movie was excellent and amazing",  // Positive
        "great film with wonderful acting",      // Positive
        "fantastic movie loved every minute",    // Positive
        "terrible movie waste of time",          // Negative
        "awful film boring and disappointing",   // Negative
        "horrible acting very bad movie",        // Negative
    ];

    let train_labels = vec![1, 1, 1, 0, 0, 0]; // 1 = positive, 0 = negative

    // Vectorize with CountVectorizer
    let mut vectorizer = CountVectorizer::new()
        .with_tokenizer(Box::new(WhitespaceTokenizer::new()))
        .with_max_features(20);

    let X_train = vectorizer.fit_transform(&train_docs).unwrap();
    println!("Vocabulary size: {}", vectorizer.vocabulary_size());  // 20 words

    // Train Gaussian Naive Bayes
    let X_train_f32 = convert_to_f32(&X_train);  // Convert f64 to f32
    let mut classifier = GaussianNB::new();
    classifier.fit(&X_train_f32, &train_labels).unwrap();

    // Predict on new reviews
    let test_docs = vec![
        "excellent movie great acting",   // Should predict positive
        "terrible film very bad",         // Should predict negative
    ];

    let X_test = vectorizer.transform(&test_docs).unwrap();
    let X_test_f32 = convert_to_f32(&X_test);
    let predictions = classifier.predict(&X_test_f32).unwrap();

    println!("Predictions: {:?}", predictions);  // [1, 0] = [positive, negative]
}

Output:

Vocabulary size: 20
Predictions: [1, 0]

Analysis:

  • Bag of Words: Simple word count features
  • 20 features: Limited vocabulary (max_features=20)
  • 100% accuracy: Overfitting on small dataset, but demonstrates concept
  • Fast training: Naive Bayes trains in O(n×m) where n=docs, m=features

Example 2: Topic Classification with TF-IDF

Multi-class classification (tech vs sports) using TF-IDF weighting.

use aprender::classification::LogisticRegression;
use aprender::text::vectorize::TfidfVectorizer;
use aprender::text::tokenize::WhitespaceTokenizer;

fn main() {
    // Training data: tech vs sports articles
    let train_docs = vec![
        "python programming language machine learning",    // Tech
        "artificial intelligence neural networks deep",    // Tech
        "software development code rust programming",      // Tech
        "basketball game score team championship",         // Sports
        "football soccer match goal tournament",           // Sports
        "tennis player serves match competition",          // Sports
    ];

    let train_labels = vec![0, 0, 0, 1, 1, 1]; // 0 = tech, 1 = sports

    // TF-IDF vectorization
    let mut vectorizer = TfidfVectorizer::new()
        .with_tokenizer(Box::new(WhitespaceTokenizer::new()));

    let X_train = vectorizer.fit_transform(&train_docs).unwrap();
    println!("Vocabulary: {} terms", vectorizer.vocabulary_size());  // 28 terms

    // Show IDF values
    let vocab: Vec<_> = vectorizer.vocabulary().iter().collect();
    for (word, &idx) in vocab.iter().take(3) {
        println!("{}: IDF = {:.3}", word, vectorizer.idf_values()[idx]);
    }
    // basketball: IDF = 2.253 (rare, important)
    // programming: IDF = 1.847 (less rare)

    // Train Logistic Regression
    let X_train_f32 = convert_to_f32(&X_train);
    let mut classifier = LogisticRegression::new()
        .with_learning_rate(0.1)
        .with_max_iter(100);

    classifier.fit(&X_train_f32, &train_labels).unwrap();

    // Test predictions
    let test_docs = vec![
        "programming code algorithm",  // Should predict tech
        "basketball score game",       // Should predict sports
    ];

    let X_test = vectorizer.transform(&test_docs).unwrap();
    let X_test_f32 = convert_to_f32(&X_test);
    let predictions = classifier.predict(&X_test_f32);

    println!("Predictions: {:?}", predictions);  // [0, 1] = [tech, sports]
}

Output:

Vocabulary: 28 terms
basketball: IDF = 2.253
programming: IDF = 1.847
Predictions: [0, 1]

Analysis:

  • TF-IDF weighting: Highlights discriminative words
  • IDF values: Rare words like "basketball" have higher IDF (2.253)
  • Common words: More frequent words have lower IDF (1.847)
  • Logistic Regression: Learns linear decision boundary
  • 100% accuracy: Perfect separation on training data

Example 3: Full Preprocessing Pipeline

Complete workflow from raw text to predictions.

use aprender::classification::GaussianNB;
use aprender::text::stem::{PorterStemmer, Stemmer};
use aprender::text::stopwords::StopWordsFilter;
use aprender::text::tokenize::WhitespaceTokenizer;
use aprender::text::vectorize::TfidfVectorizer;
use aprender::text::Tokenizer;

fn main() {
    let raw_docs = vec![
        "The machine learning algorithms are improving rapidly",
        "The team scored three goals in the championship match",
    ];
    let labels = vec![0, 1]; // 0 = tech, 1 = sports

    // Step 1: Tokenization
    let tokenizer = WhitespaceTokenizer::new();
    let tokenized: Vec<Vec<String>> = raw_docs
        .iter()
        .map(|doc| tokenizer.tokenize(doc).unwrap())
        .collect();

    // Step 2: Lowercase + Stop words filtering
    let filter = StopWordsFilter::english();
    let filtered: Vec<Vec<String>> = tokenized
        .iter()
        .map(|tokens| {
            let lower: Vec<String> = tokens.iter().map(|t| t.to_lowercase()).collect();
            filter.filter(&lower).unwrap()
        })
        .collect();

    // Step 3: Stemming
    let stemmer = PorterStemmer::new();
    let stemmed: Vec<Vec<String>> = filtered
        .iter()
        .map(|tokens| stemmer.stem_tokens(tokens).unwrap())
        .collect();

    println!("After preprocessing: {:?}", stemmed[0]);
    // ["machin", "learn", "algorithm", "improv", "rapid"]

    // Step 4: Rejoin and vectorize
    let processed: Vec<String> = stemmed
        .iter()
        .map(|tokens| tokens.join(" "))
        .collect();

    let mut vectorizer = TfidfVectorizer::new()
        .with_tokenizer(Box::new(WhitespaceTokenizer::new()));
    let X = vectorizer.fit_transform(&processed).unwrap();

    // Step 5: Classification
    let X_f32 = convert_to_f32(&X);
    let mut classifier = GaussianNB::new();
    classifier.fit(&X_f32, &labels).unwrap();

    let predictions = classifier.predict(&X_f32).unwrap();
    println!("Predictions: {:?}", predictions);  // [0, 1] = [tech, sports]
}

Output:

After preprocessing: ["machin", "learn", "algorithm", "improv", "rapid"]
Predictions: [0, 1]

Pipeline Analysis:

StageInputOutputEffect
Tokenization"The machine learning..."["The", "machine", ...]Split into words
Lowercase + Stop words11 tokens8 tokensRemove "the", "are", "in"
Stemming["machine", "learning"]["machin", "learn"]Normalize to roots
TF-IDFText tokens31-dimensional vectorsNumerical features
ClassificationFeature vectorsClass labelsPredictions

Key Benefits:

  • Vocabulary reduction: 27% fewer tokens after stop words
  • Normalization: "improving" → "improv", "algorithms" → "algorithm"
  • Generalization: Stemming helps match "learn", "learning", "learned"
  • Discriminative features: TF-IDF highlights important words

Model Selection Guidelines

Gaussian Naive Bayes

Best for:

  • Text classification with sparse features
  • Large vocabularies (thousands of features)
  • Fast training required
  • Probabilistic predictions needed

Advantages:

  • Extremely fast (O(n×m) training)
  • Works well with high-dimensional data
  • No hyperparameter tuning needed
  • Probabilistic outputs

Limitations:

  • Assumes feature independence (rarely true)
  • Less accurate than discriminative models
  • Sensitive to feature scaling

Logistic Regression

Best for:

  • When you need interpretable models
  • Feature importance analysis
  • Balanced datasets
  • Reliable probability estimates

Advantages:

  • Learns feature weights (interpretable)
  • Robust to correlated features
  • Regularization prevents overfitting
  • Well-calibrated probabilities

Limitations:

  • Slower training than Naive Bayes
  • Requires hyperparameter tuning (learning rate, iterations)
  • Sensitive to feature scaling

Best Practices

Feature Extraction

CountVectorizer (Bag of Words):

  • ✅ Simple baseline, easy to understand
  • ✅ Fast computation
  • ❌ Ignores word importance
  • Use when: Starting a project, small datasets

TfidfVectorizer:

  • ✅ Weights by importance
  • ✅ Better performance than BoW
  • ✅ Down-weights common words
  • Use when: Production systems, larger datasets

Preprocessing

Always include:

  1. Tokenization (WhitespaceTokenizer or WordTokenizer)
  2. Lowercase normalization
  3. Stop words filtering (unless sentiment analysis needs "not", "no")

Optional but recommended: 4. Stemming (PorterStemmer) for English 5. Max features limit (1000-5000 for efficiency)

Evaluation

Train/Test Split:

// Split data 80/20
let split_idx = (docs.len() * 4) / 5;
let (train_docs, test_docs) = docs.split_at(split_idx);
let (train_labels, test_labels) = labels.split_at(split_idx);

Metrics:

  • Accuracy: Overall correctness
  • Precision/Recall: Class-specific performance
  • Confusion matrix: Error analysis

Running the Example

cargo run --example text_classification

The example demonstrates three scenarios:

  1. Sentiment classification - Bag of Words with Gaussian NB
  2. Topic classification - TF-IDF with Logistic Regression
  3. Full pipeline - Complete preprocessing workflow

Key Takeaways

  1. TF-IDF > Bag of Words: Almost always better performance
  2. Preprocessing matters: Stop words + stemming improve generalization
  3. Naive Bayes: Fast baseline, good for high-dimensional data
  4. Logistic Regression: More accurate, interpretable weights
  5. Pipeline is crucial: Consistent preprocessing for train/test

Real-World Applications

  • Spam Detection: Email → [spam, not spam]
  • Sentiment Analysis: Review → [positive, negative, neutral]
  • Topic Classification: News article → [politics, sports, tech, ...]
  • Language Detection: Text → [English, Spanish, French, ...]
  • Intent Classification: User query → [question, command, statement]

Next Steps

After text classification, explore:

  • Word embeddings: Word2Vec, GloVe for semantic similarity
  • Deep learning: RNNs, Transformers for contextual understanding
  • Multi-label classification: Documents with multiple categories
  • Active learning: Efficiently label new training data

References

  • Manning, C.D., Raghavan, P., Schütze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.
  • Joachims, T. (1998). "Text categorization with support vector machines." Proceedings of ECML.
  • McCallum, A., Nigam, K. (1998). "A comparison of event models for naive bayes text classification." AAAI Workshop.