Matrix Operations

This chapter demonstrates Trueno's matrix operations using the matrix_operations example.

Running the Example

cargo run --example matrix_operations

Basic Matrix Operations

Creating Matrices

use trueno::Matrix;

// Create from row-major data
let a = Matrix::from_vec(2, 3, vec![
    1.0, 2.0, 3.0,  // Row 0
    4.0, 5.0, 6.0,  // Row 1
])?;

// Identity matrix
let identity = Matrix::identity(3);

// Zero matrix
let zeros = Matrix::zeros(2, 3);

Matrix Multiplication

// C = A × B
let a = Matrix::from_vec(2, 3, vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0])?;
let b = Matrix::from_vec(3, 2, vec![7.0, 8.0, 9.0, 10.0, 11.0, 12.0])?;

let c = a.matmul(&b)?;  // Result: 2×2 matrix

Matrix-Vector Multiplication

use trueno::{Matrix, Vector};

let weights = Matrix::from_vec(3, 4, weight_data)?;
let input = Vector::from_slice(&[1.0, 2.0, 3.0, 4.0]);

let output = weights.matvec(&input)?;  // Result: Vector of length 3

Transpose

let a = Matrix::from_vec(2, 3, vec![1.0, 2.0, 3.0, 4.0, 5.0, 6.0])?;
let at = a.transpose();  // Result: 3×2 matrix

Neural Network Layers

Linear Layer (Dense)

fn linear_layer(
    input: &Vector,
    weights: &Matrix,
    bias: &Vector,
) -> Result<Vector, TruenoError> {
    let output = weights.matvec(input)?;
    output.add(bias)
}

Batch Processing

// Process multiple samples through the same layer
let samples = vec![
    Vector::from_slice(&[0.2, -0.3, 0.5]),
    Vector::from_slice(&[0.3, 0.0, 0.1]),
    Vector::from_slice(&[0.0, 0.3, 0.4]),
];

for sample in &samples {
    let output = weights.matvec(sample)?;
    println!("Output: {:?}", output.as_slice());
}

Mathematical Properties

The example verifies key mathematical properties:

Identity Property

// I × v = v
let identity = Matrix::identity(3);
let v = Vector::from_slice(&[1.0, 2.0, 3.0]);
let result = identity.matvec(&v)?;
assert_eq!(result.as_slice(), v.as_slice());

Transpose Property

// (A × v)^T = v^T × A^T
// This is verified in the example

Zero Property

// A × 0 = 0
let zeros = Vector::from_slice(&[0.0, 0.0, 0.0, 0.0]);
let result = weights.matvec(&zeros)?;
// All elements should be 0

Batched Matrix Multiplication

3D Tensors (Batch Processing)

Process multiple matrix multiplications in a single call:

use trueno::Matrix;

// Shape: [batch, m, k] @ [batch, k, n] -> [batch, m, n]
let batch = 4;
let m = 32;
let k = 64;
let n = 32;

let a_data: Vec<f32> = vec![0.1; batch * m * k];
let b_data: Vec<f32> = vec![0.2; batch * k * n];

let result = Matrix::batched_matmul(&a_data, &b_data, batch, m, k, n)?;

4D Tensors (Multi-Head Attention)

The exact pattern used in transformer attention (Q @ K^T and attn @ V):

// Simulate multi-head attention: Q @ K^T
// Shape: [batch, heads, seq, head_dim] @ [batch, heads, head_dim, seq]
let batch = 1;
let heads = 12;
let seq_len = 512;
let head_dim = 64;

let q_data: Vec<f32> = vec![0.0; batch * heads * seq_len * head_dim];
let kt_data: Vec<f32> = vec![0.0; batch * heads * head_dim * seq_len];

let attn_scores = Matrix::batched_matmul_4d(
    &q_data,
    &kt_data,
    batch,
    heads,
    seq_len,   // m
    head_dim,  // k
    seq_len,   // n
)?;
// Output: [batch, heads, seq_len, seq_len] attention scores

This is critical for transformer inference performance - each (batch, head) pair is processed independently using SIMD matmul, achieving ~50 GFLOPS vs ~0.1 GFLOPS for naive implementation.

Performance Considerations

Blocking for Cache Efficiency

Trueno uses blocked matrix multiplication for better cache utilization:

// Automatic blocking for large matrices
let large_a = Matrix::from_vec(1024, 1024, data_a)?;
let large_b = Matrix::from_vec(1024, 1024, data_b)?;
let c = large_a.matmul(&large_b)?;  // Uses tiled algorithm internally

SIMD Acceleration

Matrix operations automatically use SIMD when beneficial:

AVX2: Process 8 f32 values per instruction
AVX-512: Process 16 f32 values per instruction
Automatic fallback to scalar for small matrices

GPU Acceleration

For large matrices, enable GPU acceleration:

cargo run --release --features gpu --example matrix_operations

The GPU backend is automatically selected for matrices above the threshold (typically 256×256).

Benchmark Suite

Run the matrix benchmark suite:

cargo run --release --example benchmark_matrix_suite

This compares:

Naive O(n³) multiplication
SIMD-optimized blocked multiplication
Parallel (rayon) multiplication

Trueno - High-Performance SIMD/GPU Compute Library