Apriori Algorithm Theory
The Apriori algorithm is a classic data mining technique for discovering frequent itemsets and association rules in transactional databases. It's widely used in market basket analysis, recommendation systems, and pattern discovery.
Problem Statement
Given a database of transactions, where each transaction contains a set of items:
- Find frequent itemsets: sets of items that appear together frequently
- Generate association rules: patterns like "if customers buy {A, B}, they likely buy {C}"
Key Concepts
1. Support
Support measures how frequently an itemset appears in the database:
Support(X) = (Transactions containing X) / (Total transactions)
Example: If {milk, bread} appears in 60 out of 100 transactions:
Support({milk, bread}) = 60/100 = 0.6 (60%)
2. Confidence
Confidence measures the reliability of an association rule:
Confidence(X => Y) = Support(X ∪ Y) / Support(X)
Example: For rule {milk} => {bread}:
Confidence = P(bread | milk) = Support({milk, bread}) / Support({milk})
If 60 transactions have {milk, bread} and 80 have {milk}:
Confidence = 60/80 = 0.75 (75%)
3. Lift
Lift measures how much more likely items are bought together than independently:
Lift(X => Y) = Confidence(X => Y) / Support(Y)
- Lift > 1.0: Positive correlation (bought together)
- Lift = 1.0: Independent (no relationship)
- Lift < 1.0: Negative correlation (substitutes)
Example: For rule {milk} => {bread}:
Lift = 0.75 / 0.70 = 1.07
Customers who buy milk are 7% more likely to buy bread than average.
The Apriori Algorithm
Core Principle: Apriori Property
If an itemset is frequent, all of its subsets must also be frequent.
Contrapositive: If an itemset is infrequent, all of its supersets must also be infrequent.
This enables efficient pruning of the search space.
Algorithm Steps
1. Find all frequent 1-itemsets (individual items)
- Scan database, count item occurrences
- Keep items with support >= min_support
2. For k = 2, 3, 4, ...:
a. Generate candidate k-itemsets from frequent (k-1)-itemsets
- Join step: Combine (k-1)-itemsets that differ by one item
- Prune step: Remove candidates with infrequent (k-1)-subsets
b. Scan database to count candidate support
c. Keep candidates with support >= min_support
d. If no frequent k-itemsets found, stop
3. Generate association rules from frequent itemsets:
- For each frequent itemset I with |I| >= 2:
- For each non-empty subset A of I:
- Generate rule A => (I \ A)
- Keep rules with confidence >= min_confidence
Example Execution
Transactions:
T1: {milk, bread, butter}
T2: {milk, bread}
T3: {bread, butter}
T4: {milk, butter}
Step 1: Frequent 1-itemsets (min_support = 50%)
{milk}: 3/4 = 75% ✓
{bread}: 3/4 = 75% ✓
{butter}: 3/4 = 75% ✓
Step 2: Generate candidate 2-itemsets
Candidates: {milk, bread}, {milk, butter}, {bread, butter}
Step 3: Count support
{milk, bread}: 2/4 = 50% ✓
{milk, butter}: 2/4 = 50% ✓
{bread, butter}: 2/4 = 50% ✓
Step 4: Generate candidate 3-itemsets
Candidate: {milk, bread, butter}
Support: 1/4 = 25% ✗ (below threshold)
Frequent itemsets: {milk}, {bread}, {butter}, {milk, bread}, {milk, butter}, {bread, butter}
Association rules (min_confidence = 60%):
{milk} => {bread} Conf: 2/3 = 67% ✓
{bread} => {milk} Conf: 2/3 = 67% ✓
{milk} => {butter} Conf: 2/3 = 67% ✓
{butter} => {milk} Conf: 2/3 = 67% ✓
{bread} => {butter} Conf: 2/3 = 67% ✓
{butter} => {bread} Conf: 2/3 = 67% ✓
Complexity Analysis
Time Complexity
Worst case: O(2^n · |D| · |T|)
- n = number of unique items
- |D| = number of transactions
- |T| = average transaction size
In practice: Much better due to pruning
- Typical: O(n^k · |D|) where k is max frequent itemset size (usually < 5)
Space Complexity
O(n + |F|)
- n = unique items
- |F| = number of frequent itemsets (exponential worst case, but usually small)
Parameters
Minimum Support
Higher support (e.g., 50%):
- Pros: Find common, reliable patterns
- Cons: Miss rare but important associations
Lower support (e.g., 10%):
- Pros: Discover niche patterns
- Cons: Many spurious associations, slower
Rule of thumb: Start with 10-30% for exploratory analysis
Minimum Confidence
Higher confidence (e.g., 80%):
- Pros: High-quality, actionable rules
- Cons: Miss weaker but still meaningful patterns
Lower confidence (e.g., 50%):
- Pros: More exploratory insights
- Cons: Less reliable rules
Rule of thumb: 60-70% for actionable business insights
Strengths
- Simplicity: Easy to understand and implement
- Completeness: Finds all frequent itemsets (no false negatives)
- Pruning: Apriori property enables efficient search
- Interpretability: Rules are human-readable
Limitations
- Multiple database scans: One scan per itemset size
- Candidate generation: Exponential in worst case
- Low support problem: Misses rare but important patterns
- Binary transactions: Doesn't handle quantities or sequences
Improvements and Variants
- FP-Growth: Avoids candidate generation using FP-tree (2x-10x faster)
- Eclat: Vertical data format (item-TID lists)
- AprioriTID: Reduces database scans
- Weighted Apriori: Assigns weights to items
- Multi-level Apriori: Handles concept hierarchies (e.g., "dairy" → "milk")
Applications
1. Market Basket Analysis
- Cross-selling: "Customers who bought X also bought Y"
- Product placement: Put related items near each other
- Promotions: Bundle frequently bought items
2. Recommendation Systems
- Collaborative filtering: Users who liked X also liked Y
- Content discovery: Articles often read together
3. Medical Diagnosis
- Symptom patterns: Patients with X often have Y
- Drug interactions: Medications prescribed together
4. Web Mining
- Clickstream analysis: Pages visited together
- Session patterns: User navigation paths
5. Bioinformatics
- Gene co-expression: Genes activated together
- Protein interactions: Proteins that interact
Best Practices
-
Data preprocessing:
- Remove duplicates
- Filter noise (very rare items)
- Group similar items (e.g., "2% milk" and "whole milk" → "milk")
-
Parameter tuning:
- Start with balanced parameters (support=20-30%, confidence=60-70%)
- Increase support if too many rules
- Lower confidence to explore weak patterns
-
Rule filtering:
- Focus on high lift rules (> 1.2)
- Remove obvious rules (e.g., "butter => milk" if everyone buys milk)
- Check rule support (avoid rare but high-confidence spurious rules)
-
Validation:
- Test rules on holdout data
- A/B test recommendations
- Monitor business metrics (sales lift, conversion rate)
Common Pitfalls
- Support too low: Millions of spurious rules
- Support too high: Miss important niche patterns
- Ignoring lift: High confidence ≠ useful (e.g., everyone buys bread)
- Confusing correlation with causation: Apriori finds associations, not causes
Example Use Case: Grocery Store
Goal: Increase basket size through cross-selling
Data: 10,000 transactions, 500 unique items
Parameters: support=5%, confidence=60%
Results:
Rule: {diapers} => {beer}
Support: 8% (800 transactions)
Confidence: 75%
Lift: 2.5
Interpretation:
- 8% of all transactions contain both diapers and beer
- 75% of diaper buyers also buy beer
- Diaper buyers are 2.5x more likely to buy beer than average
Action:
- Place beer near diapers
- Offer "diaper + beer" bundle discount
- Target diaper buyers with beer promotions
Expected Result: 10-20% increase in beer sales among diaper buyers
Mathematical Foundations
Set Theory
Frequent itemset mining is fundamentally about:
- Power set: All 2^n possible itemsets from n items
- Subset lattice: Hierarchical structure of itemsets
- Anti-monotonicity: Apriori property (subset frequency ≥ superset frequency)
Probability
Association rules encode conditional probabilities:
- Support: P(X)
- Confidence: P(Y|X) = P(X ∩ Y) / P(X)
- Lift: P(Y|X) / P(Y)
Information Theory
- Mutual information: Measures dependence between itemsets
- Entropy: Quantifies uncertainty in item distributions
Further Reading
- Original Apriori Paper: Agrawal & Srikant (1994) - "Fast Algorithms for Mining Association Rules"
- FP-Growth: Han et al. (2000) - "Mining Frequent Patterns without Candidate Generation"
- Market Basket Analysis: Berry & Linoff (2004) - "Data Mining Techniques"
- Advanced Topics: Tan et al. (2006) - "Introduction to Data Mining"