Chapter 54: Function Boundary Extraction (`pmat extract`)

Overview

The pmat extract command provides direct tree-sitter AST parsing of individual files, outputting function boundaries as structured JSON. Unlike pmat query which requires an index, extract works on any single file with zero setup — making it ideal for editor integrations, CI scripts, automated file splitting, and quick structural analysis.

Since v3.3.0, the output includes file-level metadata (imports, test boundaries) and per-item visibility — everything needed to split large files into compilable parts.

Usage

# Extract all function/struct/enum/trait boundaries from a file
pmat extract --list src/main.rs

# Pipe to jq for analysis
pmat extract --list src/handlers.rs | jq '.items[] | select(.type == "function")'

# Count items by type
pmat extract --list src/lib.rs | jq '.items | group_by(.type) | map({type: .[0].type, count: length})'

# Find largest functions (by line count)
pmat extract --list src/parser.rs | jq '.items | sort_by(-.lines) | .[0:5]'

# List all imports
pmat extract --list src/lib.rs | jq '.imports[]'

# Find public functions only
pmat extract --list src/lib.rs | jq '.items[] | select(.visibility == "pub" and .type == "function")'

# Get test module boundary
pmat extract --list src/lib.rs | jq '.cfg_test_line'

Output Format

The output is a JSON object with file-level metadata and an items array:

Top-Level Fields

Field	Type	Description
`file`	string	File path as provided
`language`	string	Detected language (`rust`, `typescript`, `python`, `c`, `cpp`, `go`, `lua`)
`imports`	string[]	Top-level import/use statements (full text)
`cfg_test_line`	number?	Line where `#[cfg(test)]` appears (Rust only, absent if none)
`items`	object[]	Extracted code items

Item Fields

Field	Type	Description
`name`	string	Function/struct/enum/trait name
`type`	string	One of: `function`, `struct`, `enum`, `trait`, `impl`, `class`, `module`, `type_alias`
`start_line`	number	First line (1-indexed)
`end_line`	number	Last line (inclusive)
`lines`	number	Total line count (`end_line - start_line + 1`)
`visibility`	string	Visibility: `pub`, `pub(crate)`, `pub(super)`, `export`, or `""` (private)

Visibility by Language

Language	Public	Crate-scoped	Private
Rust	`"pub"`	`"pub(crate)"`, `"pub(super)"`	`""`
TypeScript	`"export"`	—	`""`
Go	`"pub"` (uppercase name)	—	`""` (lowercase name)
Python, C, C++, Lua	—	—	`""` (always)

Example Output

$ pmat extract --list src/cache.rs

{
  "file": "src/cache.rs",
  "language": "rust",
  "imports": [
    "use std::collections::HashMap;"
  ],
  "cfg_test_line": 42,
  "items": [
    {
      "name": "Cache",
      "type": "struct",
      "start_line": 4,
      "end_line": 8,
      "lines": 5,
      "visibility": "pub"
    },
    {
      "name": "Cache",
      "type": "class",
      "start_line": 10,
      "end_line": 30,
      "lines": 21,
      "visibility": ""
    },
    {
      "name": "new",
      "type": "function",
      "start_line": 11,
      "end_line": 17,
      "lines": 7,
      "visibility": "pub"
    },
    {
      "name": "get",
      "type": "function",
      "start_line": 19,
      "end_line": 23,
      "lines": 5,
      "visibility": "pub"
    },
    {
      "name": "evict_expired",
      "type": "function",
      "start_line": 25,
      "end_line": 29,
      "lines": 5,
      "visibility": ""
    }
  ]
}

Supported Languages

Language	Extensions	Imports Detected	Visibility
Rust	`.rs`	`use`, `extern crate`	`pub`, `pub(crate)`, `pub(super)`
TypeScript/JavaScript	`.ts`, `.tsx`, `.js`, `.jsx`, `.mjs`	`import`	`export`
Python	`.py`, `.pyi`	`import`, `from ... import`	—
C	`.c`, `.h`	`#include`	—
C++	`.cpp`, `.cc`, `.cxx`, `.hpp`, `.hxx`	`#include`	—
Go	`.go`	`import`	Uppercase = exported
Lua	`.lua`	—	—

Use Cases

Automated File Splitting

The primary motivation for rich metadata: split large files while preserving compilability.

# Extract boundaries for a large file
pmat extract --list src/big_module.rs > boundaries.json

# A splitting tool can:
# 1. Read imports → prepend to each split part
# 2. Read cfg_test_line → separate test code from production code
# 3. Read visibility → determine which items belong in the public API

Editor Integration

Extract function boundaries for jump-to-definition or outline views:

# Get function list for editor sidebar
pmat extract --list "$FILE" | jq '[.items[] | {name, type, line: .start_line, visibility}]'

CI Pipeline — File Complexity Gate

# Fail if any function exceeds 100 lines
MAX_LINES=100
pmat extract --list src/handler.rs | \
  jq --argjson max "$MAX_LINES" '[.items[] | select(.type == "function" and .lines > $max)]' | \
  jq -e 'length == 0' || { echo "Functions exceed $MAX_LINES lines"; exit 1; }

API Surface Analysis

# List all public functions across a crate
for f in src/**/*.rs; do
  pmat extract --list "$f" | \
    jq --arg file "$f" '.items[] | select(.visibility == "pub" and .type == "function") | {file: $file, name}'
done

Compare File Structure Across Versions

# Before refactoring
pmat extract --list src/old.rs > before.json

# After refactoring
pmat extract --list src/new.rs > after.json

# Diff item names
diff <(jq '.items[].name' before.json) <(jq '.items[].name' after.json)

Feed into `pmat context --format json`

The extract command complements context --format json (see Chapter 2). While context provides project-wide structure with quality metrics, extract provides per-file granularity with exact line boundaries:

# Project-level overview
pmat context --format json -p . | jq '.files | length'

# File-level detail
pmat extract --list src/main.rs | jq '.items | length'

Running the Example

cargo run --example extract_demo

This demonstrates extraction across Rust, Python, and TypeScript files with imports, visibility, and test boundary detection.

PMAT: The PAIML MCP Agent Toolkit