Chapter 35: Semantic Search and Code Clustering

PMAT provides powerful semantic search, topic modeling, and code clustering capabilities using pure Rust implementations. No API keys or external services required.

Overview

The semantic search stack uses three core libraries:

LibraryPurposeKey Algorithms
aprenderMachine LearningTF-IDF, LDA, K-means, DBSCAN
trueno-ragRAG PipelineHybrid retrieval, RRF fusion
trueno-graphGraph DatabasePageRank, BFS, Louvain

Quick Start

Topic Extraction

Extract semantic topics from your codebase using Latent Dirichlet Allocation (LDA):

# Extract 5 topics
pmat analyze topics --num-topics 5

# Filter by language
pmat analyze topics --num-topics 5 --language rust

# JSON output for CI/CD
pmat analyze topics --num-topics 5 --format json

Example Output:

📊 Topic Extraction Results:
   Documents: 1270
   Topics: 5

   Topic 0 (854 documents):
     Top terms:
       - pub (0.039)
       - fn (0.025)
       - let (0.021)
       - impl (0.015)

   Topic 1 (412 documents):
     Top terms:
       - test (0.045)
       - assert (0.032)
       - #[test] (0.028)

Code Clustering

Group similar code files using various clustering algorithms:

# K-means clustering (specify number of clusters)
pmat analyze cluster --method kmeans --k 8

# DBSCAN (density-based, automatic cluster count)
pmat analyze cluster --method dbscan

# Hierarchical clustering
pmat analyze cluster --method hierarchical --k 5

Example Output:

📊 Clustering Results (kmeans):
   Documents: 1270
   Clusters: 8

   Cluster 0 (425 files):
     - src/services/analysis.rs
     - src/services/complexity.rs
     - src/services/dead_code.rs
     ... and 422 more

   Cluster 1 (312 files):
     - src/cli/commands.rs
     - src/cli/handlers/mod.rs
     ... and 309 more

Algorithms

TF-IDF Vectorization

Converts code files to numerical vectors based on term frequency:

  • TF (Term Frequency): How often a term appears in a document
  • IDF (Inverse Document Frequency): Penalizes common terms across all documents
#![allow(unused)]
fn main() {
// Internally uses aprender's TfidfVectorizer
let vectorizer = TfidfVectorizer::new()
    .with_max_features(1000)
    .with_min_df(2);
}

Citation: Manning, C. D., Raghavan, P., & Schütze, H. (2008). “Introduction to Information Retrieval.”

LDA Topic Modeling

Discovers latent topics in your codebase:

  • Each document is a mixture of topics
  • Each topic is a distribution over terms
  • Uses Gibbs sampling for inference

Citation: Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). “Latent Dirichlet Allocation.” JMLR.

K-means Clustering

Partitions code into k clusters by minimizing within-cluster variance:

  1. Initialize k centroids randomly
  2. Assign each document to nearest centroid
  3. Recompute centroids as cluster means
  4. Repeat until convergence

Citation: MacQueen, J. (1967). “Some Methods for Classification and Analysis of Multivariate Observations.”

DBSCAN Clustering

Density-based clustering that finds arbitrarily shaped clusters:

  • eps: Maximum distance between neighbors
  • min_samples: Minimum points to form a cluster
  • Automatically identifies outliers (noise points)

Citation: Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). “A Density-Based Algorithm for Discovering Clusters.”

CI/CD Integration

JSON Output

pmat analyze topics --num-topics 5 --format json > topics.json
pmat analyze cluster --method kmeans --k 8 --format json > clusters.json

GitHub Actions Example

name: Code Analysis
on: [push]

jobs:
  semantic-analysis:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install PMAT
        run: cargo install pmat

      - name: Extract Topics
        run: pmat analyze topics --num-topics 10 --format json > topics.json

      - name: Cluster Code
        run: pmat analyze cluster --method kmeans --k 5 --format json > clusters.json

      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: semantic-analysis
          path: |
            topics.json
            clusters.json

Use Cases

1. Understanding New Codebases

# What are the main themes in this codebase?
pmat analyze topics --num-topics 10

# How is the code organized?
pmat analyze cluster --method hierarchical --k 5
# Group similar files together
pmat analyze cluster --method kmeans --k 10

# Find outliers (unusual code)
pmat analyze cluster --method dbscan

3. Identifying Code Modules

# Discover natural module boundaries
pmat analyze cluster --method hierarchical --k 8

# Topics often align with modules
pmat analyze topics --num-topics 8

4. Code Review Preparation

# Understand code themes before review
pmat analyze topics --num-topics 5 --language rust

Performance

OperationFilesTime
Topic Extraction (5 topics)1,000~2s
K-means (k=5)1,000~1s
DBSCAN1,000~3s
Hierarchical1,000~2s

Comparison with API-Based Solutions

FactorPMAT (Local)API-Based
Latency10-50ms200-500ms
Cost$0$0.10-$1/project
PrivacyCode stays localCode sent to cloud
OfflineWorks offlineRequires internet
ReproducibilityDeterministicModel versions change

Specification

For implementation details, see:

  • docs/specifications/semantic-search-feature.md
  • 10 peer-reviewed citations supporting algorithm choices

Example

Run the demo example:

cargo run --example semantic_search_demo

Summary

PMAT’s semantic search provides:

  • Topic modeling with LDA for understanding code themes
  • Clustering with K-means, DBSCAN, and hierarchical methods
  • Pure Rust implementation with zero external dependencies
  • Offline operation - no API keys required
  • CI/CD ready with JSON output support