Graph Algorithms Theory

Graph algorithms are fundamental tools for analyzing relationships and structures in networked data. This chapter covers the theory behind aprender's graph module, focusing on efficient representations and centrality measures.

Graph Representation

Adjacency List vs CSR

Graphs can be represented in multiple ways, each with different performance characteristics:

Adjacency List (HashMap-based):

HashMap<NodeId, Vec<NodeId>>
Pros: Easy to modify, intuitive API
Cons: Poor cache locality, 50-70% memory overhead from pointers

Compressed Sparse Row (CSR):

Two flat arrays: row_ptr (offsets) and col_indices (neighbors)
Pros: 50-70% memory reduction, sequential access (3-5x fewer cache misses)
Cons: Immutable structure, slightly more complex construction

Aprender uses CSR for production workloads, optimizing for read-heavy analytics.

CSR Format Details

For a graph with n nodes and m edges:

row_ptr: [0, 2, 5, 7, ...]  # length = n + 1
col_indices: [1, 3, 0, 2, 4, ...]  # length = m (undirected: 2m)

Neighbors of node v are stored in:

col_indices[row_ptr[v] .. row_ptr[v+1]]

Memory comparison (1M nodes, 5M edges):

HashMap: ~240 MB (pointers + Vec overhead)
CSR: ~84 MB (two flat arrays)

Degree Centrality

Definition

Degree centrality measures the number of edges connected to a node. It identifies the most "popular" nodes in a network.

Unnormalized degree:

C_D(v) = deg(v)

Freeman normalization (for comparability across graphs):

C_D(v) = deg(v) / (n - 1)

where n is the number of nodes.

Implementation

use aprender::graph::Graph;

let edges = vec![(0, 1), (1, 2), (2, 3), (0, 2)];
let graph = Graph::from_edges(&edges, false);

let centrality = graph.degree_centrality();
for (node, score) in centrality.iter() {
    println!("Node {}: {:.3}", node, score);
}

Time Complexity

Construction: O(n + m) to build CSR
Query: O(1) per node (subtract adjacent row_ptr values)
All nodes: O(n)

Applications

Social networks: Find influencers by connection count
Protein interaction networks: Identify hub proteins
Transportation: Find major transit hubs

PageRank

Theory

PageRank models the probability that a random surfer lands on a node. Originally developed by Google for web page ranking, it considers both quantity and quality of connections.

Iterative formula:

PR(v) = (1-d)/n + d * Σ[PR(u) / outdeg(u)]

where:

d = damping factor (typically 0.85)
n = number of nodes
Sum over all nodes u with edges to v

Dangling Nodes

Nodes with no outgoing edges (dangling nodes) require special handling to preserve the probability distribution:

dangling_sum = Σ PR(v) for all dangling v
PR_new(v) += d * dangling_sum / n

Without this correction, rank "leaks" out of the system and Σ PR(v) ≠ 1.

Numerical Stability

Naive summation accumulates O(n·ε) floating-point error on large graphs. Aprender uses Kahan compensated summation:

let mut sum = 0.0;
let mut c = 0.0;  // Compensation term

for value in values {
    let y = value - c;
    let t = sum + y;
    c = (t - sum) - y;  // Recover low-order bits
    sum = t;
}

Result: Σ PR(v) = 1.0 within 1e-10 precision (vs 1e-5 naive).

Implementation

use aprender::graph::Graph;

let edges = vec![(0, 1), (1, 2), (2, 3), (3, 0)];
let graph = Graph::from_edges(&edges, true);  // directed

let ranks = graph.pagerank(0.85, 100, 1e-6).unwrap();
println!("PageRank scores: {:?}", ranks);

Time Complexity

Per iteration: O(n + m)
Convergence: Typically 20-50 iterations
Total: O(k(n + m)) where k = iteration count

Applications

Web search: Rank pages by importance
Social networks: Identify influential users (considers network structure)
Citation analysis: Find seminal papers

Betweenness Centrality

Theory

Betweenness centrality measures how often a node appears on shortest paths between other nodes. High betweenness indicates bridging role in the network.

Formula:

C_B(v) = Σ[σ_st(v) / σ_st]

where:

σ_st = number of shortest paths from s to t
σ_st(v) = number of those paths passing through v
Sum over all pairs s ≠ t ≠ v

Brandes' Algorithm

Naive computation is O(n³). Brandes' algorithm reduces this to O(nm) using two phases:

Phase 1: Forward BFS from each source

Compute shortest path counts
Build predecessor lists

Phase 2: Backward accumulation

Propagate dependencies from leaves to root
Accumulate betweenness scores

Parallel Implementation

The outer loop (BFS from each source) is embarrassingly parallel:

use rayon::prelude::*;

let partial_scores: Vec<Vec<f64>> = (0..n)
    .into_par_iter()  // Parallel iterator
    .map(|source| brandes_bfs_from_source(source))
    .collect();

// Reduce (single-threaded, fast)
let mut centrality = vec![0.0; n];
for partial in partial_scores {
    for (i, &score) in partial.iter().enumerate() {
        centrality[i] += score;
    }
}

Expected speedup: ~8x on 8-core CPU for graphs with >1K nodes.

Normalization

For undirected graphs, each path is counted twice:

if !is_directed {
    for score in &mut centrality {
        *score /= 2.0;
    }
}

Implementation

use aprender::graph::Graph;

let edges = vec![
    (0, 1), (1, 2), (2, 3),  // Linear chain
    (1, 4), (4, 3),          // Shortcut
];
let graph = Graph::from_edges(&edges, false);

let betweenness = graph.betweenness_centrality();
println!("Node 1 betweenness: {:.2}", betweenness[1]);  // High (bridge)

Time Complexity

Serial: O(nm) for unweighted graphs
Parallel: O(nm / p) where p = number of cores
Space: O(n + m) per thread

Applications

Social networks: Find connectors between communities
Transportation: Identify critical junctions
Epidemiology: Find super-spreaders in contact networks

Performance Characteristics

Memory Usage (1M nodes, 10M edges)

Representation	Memory	Cache Misses
HashMap adjacency	480 MB	High (pointer chasing)
CSR adjacency	168 MB	Low (sequential)

Runtime Benchmarks (Intel i7-8700K, 6 cores)

Algorithm	10K nodes	100K nodes	1M nodes
Degree centrality	<1 ms	8 ms	95 ms
PageRank (50 iter)	12 ms	180 ms	2.4 s
Betweenness (serial)	450 ms	52 s	timeout
Betweenness (parallel)	95 ms	8.7 s	89 s

Parallelization benefit: 4.7x speedup on 6-core CPU.

Real-World Applications

Problem: Identify influential users in a social network.

Approach:

Build graph from friendship/follower edges
Compute PageRank for overall influence
Compute betweenness to find community bridges
Compute degree for local popularity

Example: Twitter influencer detection, LinkedIn connection recommendations.

Supply Chain Optimization

Problem: Find critical nodes in a logistics network.

Approach:

Model warehouses/suppliers as nodes
Compute betweenness centrality
High-betweenness nodes are single points of failure
Add redundancy or buffer inventory

Example: Amazon warehouse placement, manufacturing supply chains.

Epidemiology

Problem: Prioritize vaccination in contact networks.

Approach:

Build contact network from tracing data
Compute betweenness centrality
Vaccinate high-betweenness individuals first
Reduces R₀ by breaking transmission paths

Example: COVID-19 contact tracing, hospital infection control.

10K nodes: error ~1e-7
100K nodes: error ~1e-5
1M nodes: error ~1e-4

With Kahan summation, error consistently <1e-10.

Degree: Quick analysis, local importance only
PageRank: Global influence, considers network structure
Betweenness: Find bridges, critical paths

Graph Construction Tips

// Build graph once, query many times
let graph = Graph::from_edges(&edges, false);

// Reuse for multiple algorithms
let degree = graph.degree_centrality();
let pagerank = graph.pagerank(0.85, 100, 1e-6).unwrap();
let betweenness = graph.betweenness_centrality();

Choosing PageRank Parameters

Damping factor (d): 0.85 standard, higher = more weight to links
Max iterations: 100 usually sufficient (convergence ~20-50 iterations)
Tolerance: 1e-6 balances precision vs speed

Summary

CSR format: 50-70% memory reduction, 3-5x cache improvement
PageRank: Global influence with Kahan summation for numerical stability
Betweenness: Identifies bridges with parallel Brandes algorithm
Performance: Scales to 1M+ nodes with parallel algorithms
Toyota Way: Eliminates waste (CSR), prevents errors (Kahan), balances load (Rayon)

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning