Case Study: Apriori Implementation

This chapter documents the complete EXTREME TDD implementation of aprender's Apriori algorithm for association rule mining from Issue #21.

Background

GitHub Issue #21: Implement Apriori Algorithm for Association Rule Mining

Requirements:

  • Frequent itemset mining with Apriori algorithm
  • Association rule generation
  • Support, confidence, and lift metrics
  • Configurable min_support and min_confidence thresholds
  • Builder pattern for ergonomic API

Initial State:

  • Tests: 652 passing (after t-SNE implementation)
  • No pattern mining module
  • Need new src/mining/mod.rs module

Implementation Summary

RED Phase

Created 15 comprehensive tests covering:

  • Constructor and builder pattern (3 tests)
  • Basic fitting and frequent itemset discovery
  • Association rule generation
  • Support calculation (static method)
  • Confidence calculation
  • Lift calculation
  • Minimum support filtering
  • Minimum confidence filtering
  • Edge cases: empty transactions, single-item transactions
  • Error handling: get_rules/get_itemsets before fit

GREEN Phase

Implemented complete Apriori algorithm (~400 lines):

Core Components:

  1. Apriori: Public API with builder pattern

    • new(), with_min_support(), with_min_confidence()
    • fit(), get_frequent_itemsets(), get_rules()
    • calculate_support() - static method
  2. AssociationRule: Rule representation

    • antecedent: Vec - items on left side
    • consequent: Vec - items on right side
    • support: f64 - P(antecedent ∪ consequent)
    • confidence: f64 - P(consequent | antecedent)
    • lift: f64 - confidence / P(consequent)
  3. Frequent Itemset Mining:

    • find_frequent_1_itemsets(): Initial scan for individual items
    • generate_candidates(): Join step (combine k-1 itemsets)
    • has_infrequent_subset(): Prune step (Apriori property)
    • prune_candidates(): Filter by minimum support
  4. Association Rule Generation:

    • generate_rules(): Extract rules from frequent itemsets
    • generate_subsets(): Power set generation for antecedents
    • Confidence and lift calculation
  5. Helper Methods:

    • calculate_support(): Count transactions containing itemset
    • Sorting: itemsets by support, rules by confidence

Key Algorithm Steps:

1. Find frequent 1-itemsets (items with support >= min_support)
2. For k = 2, 3, 4, ...:
   a. Generate candidate k-itemsets from (k-1)-itemsets
   b. Prune candidates with infrequent subsets (Apriori property)
   c. Count support in database
   d. Keep itemsets with support >= min_support
   e. If no frequent k-itemsets, stop
3. Generate association rules:
   a. For each frequent itemset with size >= 2
   b. Generate all non-empty proper subsets as antecedents
   c. Calculate confidence = support(itemset) / support(antecedent)
   d. Keep rules with confidence >= min_confidence
   e. Calculate lift = confidence / support(consequent)
4. Sort itemsets by support (descending)
5. Sort rules by confidence (descending)

REFACTOR Phase

  • Added Apriori to prelude
  • Zero clippy warnings
  • Comprehensive documentation with examples
  • Example demonstrating 8 real-world scenarios

Final State:

  • Tests: 652 → 667 (+15)
  • Zero warnings
  • All quality gates passing

Algorithm Details

Time Complexity

Theoretical worst case: O(2^n · |D| · |T|)

  • n = number of unique items
  • |D| = number of transactions
  • |T| = average transaction size

Practical: O(n^k · |D|) where k is max frequent itemset size

  • k typically < 5 in real data
  • Apriori pruning dramatically reduces candidates

Space Complexity

O(n + |F|)

  • n = unique items (for counting)
  • |F| = number of frequent itemsets (usually small)

Candidate Generation Strategy

Join step: Combine two (k-1)-itemsets that differ by exactly one item

fn generate_candidates(&self, prev_itemsets: &[(HashSet<usize>, f64)]) -> Vec<HashSet<usize>> {
    let mut candidates = Vec::new();
    for i in 0..prev_itemsets.len() {
        for j in (i + 1)..prev_itemsets.len() {
            let set1 = &prev_itemsets[i].0;
            let set2 = &prev_itemsets[j].0;
            let union: HashSet<usize> = set1.union(set2).copied().collect();

            if union.len() == set1.len() + 1 {
                // Valid k-itemset candidate
                if !self.has_infrequent_subset(&union, prev_itemsets) {
                    candidates.push(union);
                }
            }
        }
    }
    candidates
}

Prune step: Remove candidates with infrequent (k-1)-subsets

fn has_infrequent_subset(&self, itemset: &HashSet<usize>, prev_itemsets: &[(HashSet<usize>, f64)]) -> bool {
    for &item in itemset {
        let mut subset = itemset.clone();
        subset.remove(&item);

        let is_frequent = prev_itemsets.iter().any(|(freq_set, _)| freq_set == &subset);
        if !is_frequent {
            return true; // Prune this candidate
        }
    }
    false
}

Parameters

Minimum Support

Default: 0.1 (10%)

Effect:

  • Higher (50%+): Finds common, reliable patterns; faster; fewer results
  • Lower (5-10%): Discovers niche patterns; slower; more results

Example:

use aprender::mining::Apriori;
let apriori = Apriori::new().with_min_support(0.3); // 30%

Minimum Confidence

Default: 0.5 (50%)

Effect:

  • Higher (80%+): High-quality, actionable rules; fewer results
  • Lower (30-50%): More exploratory insights; more rules

Example:

use aprender::mining::Apriori;
let apriori = Apriori::new()
    .with_min_support(0.2)
    .with_min_confidence(0.7); // 70%

Example Highlights

The example (market_basket_apriori.rs) demonstrates:

  1. Basic grocery transactions - 10 transactions, 5 items
  2. Support threshold effects - 20% vs 50%
  3. Breakfast category analysis - Domain-specific patterns
  4. Lift interpretation - Positive/negative correlation
  5. Confidence vs support trade-off - Parameter tuning
  6. Product placement - Business recommendations
  7. Item frequency analysis - Popularity rankings
  8. Cross-selling opportunities - Sorted by lift

Output excerpt:

Frequent itemsets (support >= 30%):
  [2] -> support: 90.00%  (Bread - most popular)
  [1] -> support: 70.00%  (Milk)
  [3] -> support: 60.00%  (Butter)
  [1, 2] -> support: 60.00%  (Milk + Bread)

Association rules (confidence >= 60%):
  [4] => [2]  (Eggs => Bread)
    Support: 50.00%
    Confidence: 100.00%
    Lift: 1.11  (11% uplift)

Key Takeaways

  1. Apriori Property: Monotonicity enables efficient pruning
  2. Support vs Confidence: Trade-off between frequency and reliability
  3. Lift > 1.0: Actual association, not just popularity
  4. Exponential growth: Itemset count grows with k (but pruning helps)
  5. Interpretable: Rules are human-readable business insights

Comparison: Apriori vs FP-Growth

FeatureAprioriFP-Growth
Data structureHorizontal (transactions)Vertical (FP-tree)
Database scansMultiple (k scans for k-itemsets)Two (build tree, mine)
Candidate generationYes (explicit)No (implicit)
MemoryO(n + |F|)O(n + tree size)
SpeedModerate2-10x faster
ImplementationSimpleComplex

When to use Apriori:

  • Moderate-size datasets (< 100K transactions)
  • Educational/prototyping
  • Need simplicity and interpretability
  • Many sparse transactions (few items per transaction)

When to use FP-Growth:

  • Large datasets (> 100K transactions)
  • Production systems requiring speed
  • Dense transactions (many items per transaction)

Use Cases

1. Retail Market Basket Analysis

Rule: {diapers} => {beer}
  Support: 8% (common enough to act on)
  Confidence: 75% (reliable pattern)
  Lift: 2.5 (strong positive correlation)

Action: Place beer near diapers, bundle promotions
Result: 10-20% sales increase

2. E-commerce Recommendations

Rule: {laptop} => {laptop bag}
  Support: 12%
  Confidence: 68%
  Lift: 3.2

Action: "Customers who bought this also bought..."
Result: Higher average order value

3. Medical Diagnosis Support

Rule: {fever, cough} => {flu}
  Support: 15%
  Confidence: 82%
  Lift: 4.1

Action: Suggest flu test when symptoms present
Result: Earlier diagnosis

4. Web Analytics

Rule: {homepage, product_page} => {cart}
  Support: 6%
  Confidence: 45%
  Lift: 1.8

Action: Optimize product page conversion flow
Result: Increased checkout rate

Testing Strategy

Unit Tests (15 implemented):

  • Correctness: Algorithm finds all frequent itemsets
  • Parameters: Support/confidence thresholds work correctly
  • Metrics: Support, confidence, lift calculated correctly
  • Edge cases: Empty data, single items, no rules
  • Sorting: Results sorted by support/confidence

Property-Based Tests (future work):

  • Apriori property: All subsets of frequent itemsets are frequent
  • Monotonicity: Higher support => fewer itemsets
  • Rule count: More itemsets => more rules
  • Confidence bounds: All rules meet min_confidence

Integration Tests:

  • Full pipeline: fit → get_itemsets → get_rules
  • Large datasets: 1000+ transactions
  • Many items: 100+ unique items

Technical Challenges Solved

Challenge 1: Efficient Candidate Generation

Problem: Naively combining all (k-1)-itemsets is O(n^k). Solution: Only join itemsets differing by one item, use HashSet for O(1) checks.

Challenge 2: Apriori Pruning

Problem: Need to verify all (k-1)-subsets are frequent. Solution: Store previous frequent itemsets, check each subset in O(k) time.

Challenge 3: Rule Generation from Itemsets

Problem: Generate all non-empty proper subsets as antecedents. Solution: Bit masking to generate power set in O(2^k) where k is itemset size (usually < 5).

fn generate_subsets(&self, items: &[usize]) -> Vec<Vec<usize>> {
    let mut subsets = Vec::new();
    let n = items.len();

    for mask in 1..(1 << n) {  // 2^n - 1 subsets
        let mut subset = Vec::new();
        for (i, &item) in items.iter().enumerate() {
            if (mask & (1 << i)) != 0 {
                subset.push(item);
            }
        }
        subsets.push(subset);
    }
    subsets
}

Challenge 4: Sorting Heterogeneous Collections

Problem: Need to sort itemsets (HashSet) for display. Solution: Convert to Vec, sort descending by support using partial_cmp.

self.frequent_itemsets.sort_by(|a, b| b.1.partial_cmp(&a.1).unwrap());
self.rules.sort_by(|a, b| b.confidence.partial_cmp(&a.confidence).unwrap());

Performance Optimizations

  1. HashSet for itemsets: O(1) membership testing
  2. Early termination: Stop when no frequent k-itemsets found
  3. Prune before database scan: Remove candidates with infrequent subsets
  4. Single pass per k: Count all candidates in one database scan

References

  1. Agrawal, R., & Srikant, R. (1994). Fast Algorithms for Mining Association Rules. VLDB.
  2. Han, J., et al. (2000). Mining Frequent Patterns without Candidate Generation. SIGMOD.
  3. Tan, P., et al. (2006). Introduction to Data Mining. Pearson.
  4. Berry, M., & Linoff, G. (2004). Data Mining Techniques. Wiley.