Code Analysis with Code2Vec and MPNN

This chapter demonstrates aprender's code analysis capabilities using Code2Vec embeddings and Message Passing Neural Networks (MPNN).

Overview

The aprender::code module provides tools for:

AST Representation: Lightweight AST node types for code structures
Path Extraction: Code2Vec-style paths between terminal nodes
Code Embeddings: Dense vector representations of code
Graph Neural Networks: MPNN for type/lifetime propagation

Use Cases

Application	Description
Code Similarity	Find similar functions across codebases
Function Naming	Predict meaningful function names
Type Inference	Propagate types through data flow
Bug Detection	Identify anomalous code patterns

Quick Start

use aprender::code::{
    AstNode, AstNodeType, Code2VecEncoder, PathExtractor,
    CodeGraph, CodeGraphNode, CodeGraphEdge, CodeEdgeType, CodeMPNN,
};

// Build an AST
let mut func = AstNode::new(AstNodeType::Function, "add");
func.add_child(AstNode::new(AstNodeType::Parameter, "x"));
func.add_child(AstNode::new(AstNodeType::Parameter, "y"));
func.add_child(AstNode::new(AstNodeType::Return, "result"));

// Extract Code2Vec paths
let extractor = PathExtractor::new(8);
let paths = extractor.extract(&func);

// Generate embedding
let encoder = Code2VecEncoder::new(128);
let embedding = encoder.aggregate_paths(&paths);
println!("Embedding dimension: {}", embedding.dim());

AST Representation

The module provides 24 AST node types covering common code constructs:

Node Types

Category	Types
Definitions	`Function`, `Struct`, `Enum`, `Trait`, `Impl`, `Module`
Statements	`Variable`, `Assignment`, `Return`, `Conditional`, `Loop`, `Match`
Expressions	`BinaryOp`, `UnaryOp`, `Call`, `Literal`, `Index`, `FieldAccess`
Types	`TypeAnnotation`, `Generic`, `Parameter`
Other	`Block`, `MatchArm`, `Import`

Token Types

Type	Description
`Identifier`	Variable/function names
`Number`	Numeric literals
`String`	String literals
`TypeName`	Type names
`Operator`	Operators (+, -, *, /)
`Keyword`	Language keywords

Code2Vec Path Extraction

Paths connect terminal nodes (leaves) through their lowest common ancestor:

fn add(x, y) -> x + y

Paths extracted:
  x → Param ↑ Func ↓ Param → y
  x → Param ↑ Func ↓ Return ↓ BinaryOp → result
  ...

Path Extractor Configuration

let extractor = PathExtractor::new(8)  // Max path length
    .with_max_paths(200);              // Max paths per method

let paths = extractor.extract(&ast);
let contexts = extractor.extract_with_context(&ast);  // With position info

Code Embeddings

The Code2VecEncoder generates dense vector representations:

let encoder = Code2VecEncoder::new(128)  // Embedding dimension
    .with_seed(42);                      // Reproducible

// Single path embedding
let path_emb = encoder.encode_path(&path);

// Aggregate all paths with attention
let code_emb = encoder.aggregate_paths(&paths);

// Access attention weights for interpretability
if let Some(weights) = code_emb.attention_weights() {
    println!("Most attended path weight: {:.3}", weights[0]);
}

Code Similarity

let emb1 = encoder.aggregate_paths(&paths1);
let emb2 = encoder.aggregate_paths(&paths2);

let similarity = emb1.cosine_similarity(&emb2);
println!("Similarity: {:.4}", similarity);

Code Graph Neural Networks

For more complex analysis, use MPNN on code graphs:

Edge Types

Edge Type	Description
`ControlFlow`	CFG edges
`DataFlow`	Def-use chains
`AstChild`	AST parent-child
`TypeAnnotation`	Type relationships
`Ownership`	Borrow/ownership
`Call`	Function calls
`Return`	Return edges

Building a Code Graph

use aprender::code::{
    CodeGraph, CodeGraphNode, CodeGraphEdge, CodeEdgeType,
};

let mut graph = CodeGraph::new();

// Add nodes with features
graph.add_node(CodeGraphNode::new(0, vec![1.0, 0.0, 0.0], "variable"));
graph.add_node(CodeGraphNode::new(1, vec![0.0, 1.0, 0.0], "variable"));
graph.add_node(CodeGraphNode::new(2, vec![0.0, 0.0, 1.0], "function"));

// Add typed edges
graph.add_edge(CodeGraphEdge::new(0, 2, CodeEdgeType::DataFlow));
graph.add_edge(CodeGraphEdge::new(1, 2, CodeEdgeType::DataFlow));

MPNN Forward Pass

use aprender::code::{CodeMPNN, pooling};

// Create MPNN with layer dimensions
let mpnn = CodeMPNN::new(&[3, 16, 8, 4]);  // 3 -> 16 -> 8 -> 4

// Forward pass
let node_embeddings = mpnn.forward(&graph);

// Graph-level embedding via pooling
let graph_emb = pooling::mean_pool(&node_embeddings);
// Also available: max_pool, sum_pool

Complete Example

use aprender::code::{
    pooling, AstNode, AstNodeType, Code2VecEncoder, CodeEdgeType,
    CodeGraph, CodeGraphEdge, CodeGraphNode, CodeMPNN, PathExtractor,
};

fn main() {
    // 1. Build AST for: fn add(x, y) -> x + y
    let mut func = AstNode::new(AstNodeType::Function, "add");
    func.add_child(AstNode::new(AstNodeType::Parameter, "x"));
    func.add_child(AstNode::new(AstNodeType::Parameter, "y"));

    let mut body = AstNode::new(AstNodeType::Block, "body");
    let mut op = AstNode::new(AstNodeType::BinaryOp, "+");
    op.add_child(AstNode::new(AstNodeType::Variable, "x"));
    op.add_child(AstNode::new(AstNodeType::Variable, "y"));

    let mut ret = AstNode::new(AstNodeType::Return, "return");
    ret.add_child(op);
    body.add_child(ret);
    func.add_child(body);

    // 2. Extract paths and generate embedding
    let extractor = PathExtractor::new(8);
    let paths = extractor.extract(&func);
    println!("Extracted {} paths", paths.len());

    let encoder = Code2VecEncoder::new(64);
    let embedding = encoder.aggregate_paths(&paths);
    println!("Function embedding: {} dimensions", embedding.dim());

    // 3. Build code graph for MPNN
    let mut graph = CodeGraph::new();
    graph.add_node(CodeGraphNode::new(0, vec![1.0, 0.0], "param_x"));
    graph.add_node(CodeGraphNode::new(1, vec![0.0, 1.0], "param_y"));
    graph.add_node(CodeGraphNode::new(2, vec![0.5, 0.5], "add_op"));

    graph.add_edge(CodeGraphEdge::new(0, 2, CodeEdgeType::DataFlow));
    graph.add_edge(CodeGraphEdge::new(1, 2, CodeEdgeType::DataFlow));

    // 4. Run MPNN
    let mpnn = CodeMPNN::new(&[2, 8, 4]);
    let node_embs = mpnn.forward(&graph);
    let graph_emb = pooling::mean_pool(&node_embs);

    println!("Graph embedding: {:?}", &graph_emb[..4]);
}

Running the Example

cargo run --example code_analysis

Output:

=== Code Analysis with Code2Vec and MPNN ===

1. Building AST for a simple function
   Function: fn add(x: i32, y: i32) -> i32 { x + y }

   AST Structure:
   Func: add
     Param: x
       Type: i32
     Param: y
       Type: i32
     Type: i32
     Block: body
       Ret: return
         BinOp: +
           Var: x
           Var: y

2. Extracting Code2Vec Paths
   Found 10 paths between terminal nodes

3. Generating Code Embeddings
   Function embedding dim: 64
   Attention weights (first 3): [0.111, 0.115, 0.086]

4. Computing Code Similarity
   add() vs sum():      0.3964 (similar structure)
   add() vs multiply(): -0.5212 (different operation)
...

References

Alon et al. (2019), "code2vec: Learning distributed representations of code"
Allamanis et al. (2018), "A survey of machine learning for big code"
Gilmer et al. (2017), "Neural Message Passing for Quantum Chemistry"

EXTREME TDD - The Aprender Guide to Zero-Defect Machine Learning