Chapter 35: Semantic Search and Code Clustering

PMAT provides powerful semantic search, topic modeling, and code clustering capabilities using pure Rust implementations. No API keys or external services required.

Overview

The semantic search stack uses three core libraries:

Library	Purpose	Key Algorithms
aprender	Machine Learning	TF-IDF, LDA, K-means, DBSCAN
trueno-rag	RAG Pipeline	Hybrid retrieval, RRF fusion
trueno-graph	Graph Database	PageRank, BFS, Louvain

Quick Start

Topic Extraction

Extract semantic topics from your codebase using Latent Dirichlet Allocation (LDA):

# Extract 5 topics
pmat analyze topics --num-topics 5

# Filter by language
pmat analyze topics --num-topics 5 --language rust

# JSON output for CI/CD
pmat analyze topics --num-topics 5 --format json

Example Output:

📊 Topic Extraction Results:
   Documents: 1270
   Topics: 5

   Topic 0 (854 documents):
     Top terms:
       - pub (0.039)
       - fn (0.025)
       - let (0.021)
       - impl (0.015)

   Topic 1 (412 documents):
     Top terms:
       - test (0.045)
       - assert (0.032)
       - #[test] (0.028)

Code Clustering

Group similar code files using various clustering algorithms:

# K-means clustering (specify number of clusters)
pmat analyze cluster --method kmeans --k 8

# DBSCAN (density-based, automatic cluster count)
pmat analyze cluster --method dbscan

# Hierarchical clustering
pmat analyze cluster --method hierarchical --k 5

Example Output:

📊 Clustering Results (kmeans):
   Documents: 1270
   Clusters: 8

   Cluster 0 (425 files):
     - src/services/analysis.rs
     - src/services/complexity.rs
     - src/services/dead_code.rs
     ... and 422 more

   Cluster 1 (312 files):
     - src/cli/commands.rs
     - src/cli/handlers/mod.rs
     ... and 309 more

trueno-rag Integration (Sprint 76+)

PMAT now uses trueno-rag for enhanced RAG pipeline performance:

BM25 Keyword Search

True relevance scoring replacing RRF heuristics:

#![allow(unused)]
fn main() {
// Internally uses trueno-rag's BM25Index
use trueno_rag::index::BM25Index;

let mut index = BM25Index::new();
index.add(chunk_id, &document_text);
let results = index.search(&query, 10);
}

Benefits over RRF:

IDF-weighted term importance (rare terms score higher)
Term frequency saturation (BM25’s k1 parameter)
True relevance vs rank-based fusion

SIMD Cosine Similarity

4-way loop unrolling for LLVM auto-vectorization:

#![allow(unused)]
fn main() {
// 2-4x speedup on AVX2, 4-8x on AVX-512
let similarity = TursoVectorDB::cosine_similarity_simd(&v1, &v2);
}

RecursiveChunker

Text chunking with overlap for RAG retrieval:

#![allow(unused)]
fn main() {
use trueno_rag::chunk::{Chunker, RecursiveChunker};

let chunker = RecursiveChunker::new(512, 64)  // chunk_size, overlap
    .with_separators(vec!["\n\n", "\n", ". ", " "]);
let chunks = chunker.chunk(&document)?;
}

LSH Index for Duplicate Detection

O(1) approximate nearest neighbor lookup:

#![allow(unused)]
fn main() {
let mut lsh = LshIndex::new(20, 5);  // bands, rows_per_band
lsh.insert(fragment_id, minhash_signature);
let candidates = lsh.query(&query_signature);  // O(1) vs O(n)
}

Collision Probability: P = 1 - (1 - s^r)^b

s=0.9 → P≈1.0 (high similarity → always candidates)
s=0.5 → P≈0.47 (medium similarity)
s=0.2 → P≈0.04 (low similarity → rarely candidates)

Algorithms

TF-IDF Vectorization

Converts code files to numerical vectors based on term frequency:

TF (Term Frequency): How often a term appears in a document
IDF (Inverse Document Frequency): Penalizes common terms across all documents

#![allow(unused)]
fn main() {
// Internally uses aprender's TfidfVectorizer
let vectorizer = TfidfVectorizer::new()
    .with_max_features(1000)
    .with_min_df(2);
}

Citation: Manning, C. D., Raghavan, P., & Schütze, H. (2008). “Introduction to Information Retrieval.”

LDA Topic Modeling

Discovers latent topics in your codebase:

Each document is a mixture of topics
Each topic is a distribution over terms
Uses Gibbs sampling for inference

Citation: Blei, D. M., Ng, A. Y., & Jordan, M. I. (2003). “Latent Dirichlet Allocation.” JMLR.

K-means Clustering

Partitions code into k clusters by minimizing within-cluster variance:

Initialize k centroids randomly
Assign each document to nearest centroid
Recompute centroids as cluster means
Repeat until convergence

Citation: MacQueen, J. (1967). “Some Methods for Classification and Analysis of Multivariate Observations.”

DBSCAN Clustering

Density-based clustering that finds arbitrarily shaped clusters:

eps: Maximum distance between neighbors
min_samples: Minimum points to form a cluster
Automatically identifies outliers (noise points)

Citation: Ester, M., Kriegel, H. P., Sander, J., & Xu, X. (1996). “A Density-Based Algorithm for Discovering Clusters.”

CI/CD Integration

JSON Output

pmat analyze topics --num-topics 5 --format json > topics.json
pmat analyze cluster --method kmeans --k 8 --format json > clusters.json

GitHub Actions Example

name: Code Analysis
on: [push]

jobs:
  semantic-analysis:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install PMAT
        run: cargo install pmat

      - name: Extract Topics
        run: pmat analyze topics --num-topics 10 --format json > topics.json

      - name: Cluster Code
        run: pmat analyze cluster --method kmeans --k 5 --format json > clusters.json

      - name: Upload Results
        uses: actions/upload-artifact@v4
        with:
          name: semantic-analysis
          path: |
            topics.json
            clusters.json

Use Cases

1. Understanding New Codebases

# What are the main themes in this codebase?
pmat analyze topics --num-topics 10

# How is the code organized?
pmat analyze cluster --method hierarchical --k 5

# Group similar files together
pmat analyze cluster --method kmeans --k 10

# Find outliers (unusual code)
pmat analyze cluster --method dbscan

3. Identifying Code Modules

# Discover natural module boundaries
pmat analyze cluster --method hierarchical --k 8

# Topics often align with modules
pmat analyze topics --num-topics 8

4. Code Review Preparation

# Understand code themes before review
pmat analyze topics --num-topics 5 --language rust

Performance

Operation	Files	Time
Topic Extraction (5 topics)	1,000	~2s
K-means (k=5)	1,000	~1s
DBSCAN	1,000	~3s
Hierarchical	1,000	~2s

Comparison with API-Based Solutions

Factor	PMAT (Local)	API-Based
Latency	10-50ms	200-500ms
Cost	$0	$0.10-$1/project
Privacy	Code stays local	Code sent to cloud
Offline	Works offline	Requires internet
Reproducibility	Deterministic	Model versions change

Specification

For implementation details, see:

docs/specifications/semantic-search-feature.md
10 peer-reviewed citations supporting algorithm choices

Example

Run the demo example:

cargo run --example semantic_search_demo

Summary

PMAT’s semantic search provides:

Topic modeling with LDA for understanding code themes
Clustering with K-means, DBSCAN, and hierarchical methods
Pure Rust implementation with zero external dependencies
Offline operation - no API keys required
CI/CD ready with JSON output support

PMAT: The PAIML MCP Agent Toolkit