Case Study: t-SNE Implementation
This chapter documents the complete EXTREME TDD implementation of aprender's t-SNE algorithm for dimensionality reduction and visualization from Issue #18.
Background
GitHub Issue #18: Implement t-SNE for Dimensionality Reduction and Visualization
Requirements:
- Non-linear dimensionality reduction (2D/3D)
- Perplexity-based similarity computation
- KL divergence minimization via gradient descent
- Parameters: n_components, perplexity, learning_rate, n_iter
- Reproducibility with random_state
Initial State:
- Tests: 640 passing
- Existing dimensionality reduction: PCA (linear)
- No non-linear dimensionality reduction
Implementation Summary
RED Phase
Created 12 comprehensive tests covering:
- Constructor and basic fitting
- Transform and fit_transform methods
- Perplexity parameter effects
- Learning rate and iteration count
- 2D and 3D embeddings
- Reproducibility with random_state
- Error handling (transform before fit)
- Local structure preservation
- Embedding finite values
GREEN Phase
Implemented complete t-SNE algorithm (~400 lines):
Core Components:
-
TSNE: Public API with builder pattern
- with_perplexity, with_learning_rate, with_n_iter, with_random_state
- fit/transform/fit_transform methods
-
Pairwise Distances (
compute_pairwise_distances):- Squared Euclidean distances in high-D
- O(n²) computation
-
Conditional Probabilities (
compute_p_conditional):- Binary search for sigma to match perplexity
- Gaussian kernel: P(j|i) ∝ exp(-||x_i - x_j||² / (2σ_i²))
- Target entropy: H = log₂(perplexity)
-
Joint Probabilities (
compute_p_joint):- Symmetrize: P_{ij} = (P(j|i) + P(i|j)) / (2N)
- Numerical stability with max(1e-12)
-
Q Matrix (
compute_q):- Student's t-distribution in low-D
- Q_{ij} ∝ (1 + ||y_i - y_j||²)^{-1}
- Heavy-tailed distribution avoids crowding
-
Gradient Computation (
compute_gradient):- ∇KL(P||Q) = 4Σ_j (p_ij - q_ij) · (y_i - y_j) / (1 + ||y_i - y_j||²)
-
Optimization:
- Gradient descent with momentum (0.5 → 0.8)
- Small random initialization (±0.00005)
- Reproducible LCG random number generator
Key Algorithm Steps:
1. Compute pairwise distances in high-D
2. Binary search for sigma to match perplexity
3. Compute conditional probabilities P(j|i)
4. Symmetrize to joint probabilities P_{ij}
5. Initialize embedding randomly (small values)
6. For each iteration:
a. Compute Q matrix (Student's t in low-D)
b. Compute gradient of KL divergence
c. Update embedding with momentum
7. Return final embedding
REFACTOR Phase
- Fixed legacy numeric constants (f32::INFINITY)
- Zero clippy warnings
- Exported TSNE in prelude
- Comprehensive documentation
- Example demonstrating all key features
Final State:
- Tests: 640 → 652 (+12)
- Zero warnings
- All quality gates passing
Algorithm Details
Time Complexity: O(n² · iterations)
- Dominated by pairwise distance computation each iteration
- Typical: 1000 iterations × O(n²) = impractical for n > 10,000
Space Complexity: O(n²)
- Distance matrix, P matrix, Q matrix all n×n
Binary Search for Perplexity:
- Target: H(P_i) = log₂(perplexity)
- Search for beta = 1/(2σ²) in range [0, ∞)
- 50 iterations max for convergence
- Tolerance: |H - target| < 1e-5
Momentum Optimization:
- Initial momentum: 0.5 (first 250 iterations)
- Final momentum: 0.8 (after iteration 250)
- Helps escape local minima and speed convergence
Parameters
-
n_components (default: 2): Output dimensions (usually 2 or 3)
-
perplexity (default: 30.0): Balance local/global structure
- Low (5-10): Very local, reveals fine clusters
- Medium (20-30): Balanced (recommended)
- High (50+): More global structure
- Rule of thumb: perplexity < n_samples / 3
-
learning_rate (default: 200.0): Gradient descent step size
- Too low: Slow convergence
- Too high: Unstable/divergence
- Typical range: 10-1000
-
n_iter (default: 1000): Number of gradient descent iterations
- Minimum: 250 for reasonable results
- Recommended: 1000 for convergence
- More iterations: Better but slower
-
random_state (default: None): Random seed for reproducibility
Example Highlights
The example demonstrates:
- Basic 4D → 2D reduction
- Perplexity effects (2.0 vs 5.0)
- 3D embedding
- Learning rate effects (50.0 vs 500.0)
- Reproducibility with random_state
- t-SNE vs PCA comparison
Key Takeaways
- Non-Linear: Captures manifolds that PCA cannot
- Local Preservation: Excellent at preserving neighborhoods
- Visualization: Best for 2D/3D plots
- Perplexity Critical: Try multiple values (5, 10, 30, 50)
- Stochastic: Different runs give different embeddings
- Slow: O(n²) limits scalability
- No Transform: Cannot embed new data points
Comparison: t-SNE vs PCA
| Feature | t-SNE | PCA |
|---|---|---|
| Type | Non-linear | Linear |
| Preserves | Local structure | Global variance |
| Speed | O(n²·iter) | O(n·d·k) |
| Transform New Data | No | Yes |
| Deterministic | No (stochastic) | Yes |
| Best For | Visualization | Preprocessing |
When to use t-SNE:
- Visualizing high-dimensional data
- Exploratory data analysis
- Finding hidden clusters
- Presentations (2D plots)
When to use PCA:
- Feature reduction before modeling
- Large datasets (n > 10,000)
- Need to transform new data
- Need deterministic results
Use Cases
- MNIST Visualization: Visualize 784D digit images in 2D
- Word Embeddings: Explore word2vec/GloVe embeddings
- Single-Cell RNA-seq: Cluster cell types
- Image Features: Visualize CNN features
- Customer Segmentation: Explore behavioral clusters
Testing Strategy
Unit Tests (12 implemented):
- Correctness: Embeddings have correct shape
- Reproducibility: Same random_state → same result
- Parameters: Perplexity, learning rate, n_iter effects
- Edge cases: Transform before fit, finite values
Property-Based Tests (future work):
- Local structure: Nearby points in high-D → nearby in low-D
- Perplexity monotonicity: Higher perplexity → smoother embedding
- Convergence: More iterations → lower KL divergence
Technical Challenges Solved
Challenge 1: Perplexity Matching
Problem: Finding sigma to match target perplexity. Solution: Binary search on beta = 1/(2σ²) with entropy target.
Challenge 2: Numerical Stability
Problem: Very small probabilities cause log(0) errors. Solution: Clamp probabilities to max(p, 1e-12).
Challenge 3: Reproducibility
Problem: std::random is non-deterministic. Solution: Custom LCG random generator with seed.
Challenge 4: Large Embedding Values
Problem: Embeddings can have very large absolute values. Solution: This is expected - t-SNE preserves relative distances, not absolute positions.
Related Topics
References
- van der Maaten, L., & Hinton, G. (2008). Visualizing Data using t-SNE. JMLR.
- Wattenberg, et al. (2016). How to Use t-SNE Effectively. Distill.
- Kobak, D., & Berens, P. (2019). The art of using t-SNE for single-cell transcriptomics. Nature Communications.