Case Study: Apriori Implementation

This chapter documents the complete EXTREME TDD implementation of aprender's Apriori algorithm for association rule mining from Issue #21.

Background

GitHub Issue #21: Implement Apriori Algorithm for Association Rule Mining

Requirements:

Frequent itemset mining with Apriori algorithm
Association rule generation
Support, confidence, and lift metrics
Configurable min_support and min_confidence thresholds
Builder pattern for ergonomic API

Initial State:

Tests: 652 passing (after t-SNE implementation)
No pattern mining module
Need new src/mining/mod.rs module

Implementation Summary

RED Phase

Created 15 comprehensive tests covering:

Constructor and builder pattern (3 tests)
Basic fitting and frequent itemset discovery
Association rule generation
Support calculation (static method)
Confidence calculation
Lift calculation
Minimum support filtering
Minimum confidence filtering
Edge cases: empty transactions, single-item transactions
Error handling: get_rules/get_itemsets before fit

GREEN Phase

Implemented complete Apriori algorithm (~400 lines):

Core Components:

Apriori: Public API with builder pattern
- new(), with_min_support(), with_min_confidence()
- fit(), get_frequent_itemsets(), get_rules()
- calculate_support() - static method
AssociationRule: Rule representation
- antecedent: Vec - items on left side
- consequent: Vec - items on right side
- support: f64 - P(antecedent ∪ consequent)
- confidence: f64 - P(consequent | antecedent)
- lift: f64 - confidence / P(consequent)
Frequent Itemset Mining:
- find_frequent_1_itemsets(): Initial scan for individual items
- generate_candidates(): Join step (combine k-1 itemsets)
- has_infrequent_subset(): Prune step (Apriori property)
- prune_candidates(): Filter by minimum support
Association Rule Generation:
- generate_rules(): Extract rules from frequent itemsets
- generate_subsets(): Power set generation for antecedents
- Confidence and lift calculation
Helper Methods:
- calculate_support(): Count transactions containing itemset
- Sorting: itemsets by support, rules by confidence

Key Algorithm Steps:

1. Find frequent 1-itemsets (items with support >= min_support)
2. For k = 2, 3, 4, ...:
   a. Generate candidate k-itemsets from (k-1)-itemsets
   b. Prune candidates with infrequent subsets (Apriori property)
   c. Count support in database
   d. Keep itemsets with support >= min_support
   e. If no frequent k-itemsets, stop
3. Generate association rules:
   a. For each frequent itemset with size >= 2
   b. Generate all non-empty proper subsets as antecedents
   c. Calculate confidence = support(itemset) / support(antecedent)
   d. Keep rules with confidence >= min_confidence
   e. Calculate lift = confidence / support(consequent)
4. Sort itemsets by support (descending)
5. Sort rules by confidence (descending)

REFACTOR Phase

Added Apriori to prelude
Zero clippy warnings
Comprehensive documentation with examples
Example demonstrating 8 real-world scenarios

Final State:

Tests: 652 → 667 (+15)
Zero warnings
All quality gates passing

Algorithm Details

Time Complexity

Theoretical worst case: O(2^n · |D| · |T|)

n = number of unique items
|D| = number of transactions
|T| = average transaction size

Practical: O(n^k · |D|) where k is max frequent itemset size

k typically < 5 in real data
Apriori pruning dramatically reduces candidates

Space Complexity

O(n + |F|)

n = unique items (for counting)
|F| = number of frequent itemsets (usually small)

Candidate Generation Strategy

Join step: Combine two (k-1)-itemsets that differ by exactly one item

fn generate_candidates(&self, prev_itemsets: &[(HashSet<usize>, f64)]) -> Vec<HashSet<usize>> {
    let mut candidates = Vec::new();
    for i in 0..prev_itemsets.len() {
        for j in (i + 1)..prev_itemsets.len() {
            let set1 = &prev_itemsets[i].0;
            let set2 = &prev_itemsets[j].0;
            let union: HashSet<usize> = set1.union(set2).copied().collect();

            if union.len() == set1.len() + 1 {
                // Valid k-itemset candidate
                if !self.has_infrequent_subset(&union, prev_itemsets) {
                    candidates.push(union);
                }
            }
        }
    }
    candidates
}

Prune step: Remove candidates with infrequent (k-1)-subsets

fn has_infrequent_subset(&self, itemset: &HashSet<usize>, prev_itemsets: &[(HashSet<usize>, f64)]) -> bool {
    for &item in itemset {
        let mut subset = itemset.clone();
        subset.remove(&item);

        let is_frequent = prev_itemsets.iter().any(|(freq_set, _)| freq_set == &subset);
        if !is_frequent {
            return true; // Prune this candidate
        }
    }
    false
}

Parameters

Minimum Support

Default: 0.1 (10%)

Effect:

Higher (50%+): Finds common, reliable patterns; faster; fewer results
Lower (5-10%): Discovers niche patterns; slower; more results

Example:

use aprender::mining::Apriori;
let apriori = Apriori::new().with_min_support(0.3); // 30%

Minimum Confidence

Default: 0.5 (50%)

Effect:

Higher (80%+): High-quality, actionable rules; fewer results
Lower (30-50%): More exploratory insights; more rules

Example:

use aprender::mining::Apriori;
let apriori = Apriori::new()
    .with_min_support(0.2)
    .with_min_confidence(0.7); // 70%

Example Highlights

The example (market_basket_apriori.rs) demonstrates:

Basic grocery transactions - 10 transactions, 5 items
Support threshold effects - 20% vs 50%
Breakfast category analysis - Domain-specific patterns
Lift interpretation - Positive/negative correlation
Confidence vs support trade-off - Parameter tuning
Product placement - Business recommendations
Item frequency analysis - Popularity rankings
Cross-selling opportunities - Sorted by lift

Output excerpt:

Frequent itemsets (support >= 30%):
  [2] -> support: 90.00%  (Bread - most popular)
  [1] -> support: 70.00%  (Milk)
  [3] -> support: 60.00%  (Butter)
  [1, 2] -> support: 60.00%  (Milk + Bread)

Association rules (confidence >= 60%):
  [4] => [2]  (Eggs => Bread)
    Support: 50.00%
    Confidence: 100.00%
    Lift: 1.11  (11% uplift)

Key Takeaways

Apriori Property: Monotonicity enables efficient pruning
Support vs Confidence: Trade-off between frequency and reliability
Lift > 1.0: Actual association, not just popularity
Exponential growth: Itemset count grows with k (but pruning helps)
Interpretable: Rules are human-readable business insights

Comparison: Apriori vs FP-Growth

Feature	Apriori	FP-Growth
Data structure	Horizontal (transactions)	Vertical (FP-tree)
Database scans	Multiple (k scans for k-itemsets)	Two (build tree, mine)
Candidate generation	Yes (explicit)	No (implicit)
Memory	O(n + \|F\|)	O(n + tree size)
Speed	Moderate	2-10x faster
Implementation	Simple	Complex

When to use Apriori:

Moderate-size datasets (< 100K transactions)
Educational/prototyping
Need simplicity and interpretability
Many sparse transactions (few items per transaction)

When to use FP-Growth:

Large datasets (> 100K transactions)
Production systems requiring speed
Dense transactions (many items per transaction)

Use Cases

1. Retail Market Basket Analysis

Rule: {diapers} => {beer}
  Support: 8% (common enough to act on)
  Confidence: 75% (reliable pattern)
  Lift: 2.5 (strong positive correlation)

Action: Place beer near diapers, bundle promotions
Result: 10-20% sales increase

2. E-commerce Recommendations

Rule: {laptop} => {laptop bag}
  Support: 12%
  Confidence: 68%
  Lift: 3.2

Action: "Customers who bought this also bought..."
Result: Higher average order value

3. Medical Diagnosis Support

Rule: {fever, cough} => {flu}
  Support: 15%
  Confidence: 82%
  Lift: 4.1

Action: Suggest flu test when symptoms present
Result: Earlier diagnosis

4. Web Analytics

Rule: {homepage, product_page} => {cart}
  Support: 6%
  Confidence: 45%
  Lift: 1.8

Action: Optimize product page conversion flow
Result: Increased checkout rate

Testing Strategy

Unit Tests (15 implemented):

Correctness: Algorithm finds all frequent itemsets
Parameters: Support/confidence thresholds work correctly
Metrics: Support, confidence, lift calculated correctly
Edge cases: Empty data, single items, no rules
Sorting: Results sorted by support/confidence

Property-Based Tests (future work):

Apriori property: All subsets of frequent itemsets are frequent
Monotonicity: Higher support => fewer itemsets
Rule count: More itemsets => more rules
Confidence bounds: All rules meet min_confidence

Integration Tests:

Full pipeline: fit → get_itemsets → get_rules
Large datasets: 1000+ transactions
Many items: 100+ unique items

Technical Challenges Solved

Challenge 1: Efficient Candidate Generation

Problem: Naively combining all (k-1)-itemsets is O(n^k). Solution: Only join itemsets differing by one item, use HashSet for O(1) checks.

Challenge 2: Apriori Pruning

Problem: Need to verify all (k-1)-subsets are frequent. Solution: Store previous frequent itemsets, check each subset in O(k) time.

Challenge 3: Rule Generation from Itemsets

Problem: Generate all non-empty proper subsets as antecedents. Solution: Bit masking to generate power set in O(2^k) where k is itemset size (usually < 5).

fn generate_subsets(&self, items: &[usize]) -> Vec<Vec<usize>> {
    let mut subsets = Vec::new();
    let n = items.len();

    for mask in 1..(1 << n) {  // 2^n - 1 subsets
        let mut subset = Vec::new();
        for (i, &item) in items.iter().enumerate() {
            if (mask & (1 << i)) != 0 {
                subset.push(item);
            }
        }
        subsets.push(subset);
    }
    subsets
}

Challenge 4: Sorting Heterogeneous Collections

Problem: Need to sort itemsets (HashSet) for display. Solution: Convert to Vec, sort descending by support using partial_cmp.

self.frequent_itemsets.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
self.rules.sort_by(|a, b| b.confidence.partial_cmp(&a.confidence).unwrap());

Performance Optimizations

HashSet for itemsets: O(1) membership testing
Early termination: Stop when no frequent k-itemsets found
Prune before database scan: Remove candidates with infrequent subsets
Single pass per k: Count all candidates in one database scan

References

Agrawal, R., & Srikant, R. (1994). Fast Algorithms for Mining Association Rules. VLDB.
Han, J., et al. (2000). Mining Frequent Patterns without Candidate Generation. SIGMOD.
Tan, P., et al. (2006). Introduction to Data Mining. Pearson.
Berry, M., & Linoff, G. (2004). Data Mining Techniques. Wiley.

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning