ML Tuner: Learned Kernel Selection

The ML Tuner provides machine learning-based throughput prediction and kernel selection for ComputeBrick operations. It uses a 42-dimension feature vector (v1.1.0) with roofline model clamping for physically-bounded predictions.

Reference: SHOWCASE-BRICK-001, Section 12

Overview

The ML Tuner consists of three main components:

  1. TunerFeatures - 42-dimension feature vector encoding model, hardware, and runtime configuration
  2. ThroughputRegressor - Predicts tokens/second throughput with roofline clamping
  3. KernelClassifier - Recommends optimal kernel (VectorizedQ4K, BatchedQ4K, etc.)

Feature Vector (DIM=42)

The TunerFeatures struct encodes all information needed for ML-based optimization:

/// Create a 42-dimension feature vector for ML tuning
fn basic_features() {
    use trueno::tuner::{QuantType, TunerFeatures};

    let features = TunerFeatures::builder()
        .model_params_b(1.5) // 1.5B parameters
        .hidden_dim(1536)
        .num_layers(28)
        .num_heads(12)
        .batch_size(4) // M=4 concurrent sequences
        .seq_len(512)
        .quant_type(QuantType::Q4K)
        .gpu_mem_bw_gbs(1000.0) // RTX 4090: ~1 TB/s
        .gpu_sm_count(128) // RTX 4090: 128 SMs
        .cuda_graphs(true)
        .build();

    // Validate and convert to vector
    assert!(features.validate().is_ok());
    let vec = features.to_vector();
    assert_eq!(vec.len(), 42);

    println!("Features created: {} dimensions", vec.len());
}

Feature Breakdown

RangeCountCategoryDescription
0-910Modelparams_b, hidden_dim, layers, heads, intermediate_dim, vocab_size, kv_heads, head_dim, rope_theta, tie_embeddings
10-1910Runtimebatch_size, seq_len, context_len, kv_cache_tokens, draft_tokens, prompt_tokens, generated_tokens, temperature, top_p, top_k
20-2910Quantquant_type (one-hot Q4_K/Q5_K/Q6_K/Q8_0/F16/F32), quant_group_size, bits_per_weight, quant_scheme_idx, has_scales
30-4112Hardwaregpu_mem_bw, gpu_compute_tflops, sm_count, tensor_cores, cuda_graphs, pcie_gen, vram_gb, cpu_threads, numa_nodes, system_ram_gb, is_unified_memory, power_limit

Throughput Prediction

The ThroughputRegressor predicts tokens/second with roofline model clamping:

/// Predict throughput with roofline model clamping
fn throughput_prediction() {
    use trueno::tuner::{QuantType, ThroughputRegressor, TunerFeatures};

    let regressor = ThroughputRegressor::new();

    // Create features for RTX 4090 with 1.5B Q4_K model
    let features = TunerFeatures::builder()
        .model_params_b(1.5)
        .batch_size(4)
        .quant_type(QuantType::Q4K)
        .gpu_mem_bw_gbs(1000.0)
        .cuda_graphs(true)
        .build();

    let prediction = regressor.predict(&features);

    println!("Predicted: {:.1} tok/s", prediction.predicted_tps);
    println!("Confidence: {:.1}%", prediction.confidence * 100.0);
    println!("Top features:");
    for (name, importance) in prediction.top_features.iter().take(3) {
        println!("  - {}: {:.1}%", name, importance * 100.0);
    }
}

Roofline Model

Predictions are clamped to physical limits using the roofline model (Williams et al., 2009):

throughput_max = gpu_mem_bw_bytes / (model_params_b * bytes_per_param)

For example, RTX 4090 (1000 GB/s) with 7B Q4_K model (~0.5 bytes/param):

  • Roofline: 1000 GB/s / (7B * 0.5) = ~286 tok/s theoretical max

The heuristic model may predict higher, but roofline clamping ensures physical plausibility.

Kernel Selection

The KernelClassifier recommends the optimal kernel implementation:

use trueno::tuner::{KernelClassifier, TunerFeatures, QuantType};

let classifier = KernelClassifier::new();

let features = TunerFeatures::builder()
    .model_params_b(1.5)
    .batch_size(4)
    .quant_type(QuantType::Q4K)
    .build();

let recommendation = classifier.predict(&features);

println!("Recommended: {:?}", recommendation.top_kernel);
println!("Confidence: {:.1}%", recommendation.confidence * 100.0);
for (kernel, conf) in recommendation.alternatives.iter().take(3) {
    println!("  - {:?}: {:.1}%", kernel, conf * 100.0);
}

Kernel Selection Rules

Batch SizeRecommended KernelRationale
M=1VectorizedQ4KSingle sequence, maximize per-token latency
M=2-3VectorizedQ4KLow batch, vectorized still efficient
M>=4BatchedQ4KHigh batch, batched attention wins

RandomForest Models (Optional)

With the ml-tuner feature, you can use aprender's RandomForest models for learned optimization:

# Cargo.toml
[dependencies]
trueno = { version = "0.13", features = ["ml-tuner"] }

Training a Custom Regressor

use trueno::tuner::{ThroughputRegressor, TunerFeatures, QuantType};

// Create RF-backed regressor with 100 trees
let mut regressor = ThroughputRegressor::with_random_forest(100);

// Generate training data from benchmarks
let training_data: Vec<(TunerFeatures, f32)> = (0..100)
    .map(|i| {
        let batch = 1 + (i % 8) as u32;
        let features = TunerFeatures::builder()
            .model_params_b(1.5)
            .batch_size(batch)
            .quant_type(QuantType::Q4K)
            .gpu_mem_bw_gbs(1000.0)
            .cuda_graphs(batch == 1)
            .build();
        // Measured throughput from benchmark
        let throughput = 200.0 + (batch as f32) * 80.0;
        (features, throughput)
    })
    .collect();

// Train the model
regressor.train_random_forest(&training_data)?;

// Predictions now use learned model
let pred = regressor.predict(&features);
println!("RF prediction: {:.1} tok/s", pred.predicted_tps);

Training a Custom Classifier

use trueno::tuner::{KernelClassifier, TunerFeatures, QuantType};

let mut classifier = KernelClassifier::with_random_forest(50);

// Label encoding: VectorizedQ4K=2, BatchedQ4K=3
let training_data: Vec<(TunerFeatures, u32)> = (0..100)
    .map(|i| {
        let batch = 1 + (i % 8) as u32;
        let features = TunerFeatures::builder()
            .model_params_b(1.5)
            .batch_size(batch)
            .quant_type(QuantType::Q4K)
            .build();
        let label = if batch >= 4 { 3 } else { 2 };
        (features, label)
    })
    .collect();

classifier.train(&training_data)?;

Full Tuner Recommendations

The BrickTuner combines throughput and kernel predictions with experiment suggestions:

use trueno::tuner::{BrickTuner, TunerFeatures, QuantType};

let tuner = BrickTuner::new();
let features = TunerFeatures::builder()
    .model_params_b(1.5)
    .batch_size(4)
    .quant_type(QuantType::Q4K)
    .gpu_mem_bw_gbs(1000.0)
    .cuda_graphs(true)
    .build();

let rec = tuner.recommend(&features);

println!("Throughput: {:.1} tok/s", rec.throughput.predicted_tps);
println!("Best kernel: {:?}", rec.kernel.top_kernel);
println!("Experiments to try:");
for exp in &rec.suggested_experiments {
    println!("  - {}", exp);
}

Running the Demo

# Default (heuristic models)
cargo run --example ml_tuner_demo

# With RandomForest models
cargo run --example ml_tuner_demo --features ml-tuner

Integration with ComputeBrick

The ML tuner integrates with ComputeBrick kernel selection:

use trueno::compute::{ComputeBrick, ComputeBrickConfig};
use trueno::tuner::{BrickTuner, TunerFeatures};

// Build features from runtime environment
let features = TunerFeatures::from_env()?;

// Get tuner recommendation
let tuner = BrickTuner::new();
let rec = tuner.recommend(&features);

// Configure ComputeBrick with recommended kernel
let config = ComputeBrickConfig::builder()
    .kernel(rec.kernel.top_kernel)
    .batch_size(features.batch_size())
    .build();

let brick = ComputeBrick::with_config(config)?;

Performance Considerations

  1. Feature extraction is cheap: TunerFeatures::to_vector() is O(1)
  2. Heuristic prediction is instant: No ML inference overhead
  3. RF inference scales with trees: 100 trees ≈ 1ms inference
  4. Train once, predict many: Cache trained models for repeated use

Phase 14: ML-Tuner Evolution

Phase 14 (E.12) adds deeper ML integration for production deployment:

MLT-10: Pre-trained Weights

Ship with pre-trained weights from CI benchmark corpus:

use trueno::tuner::{BrickTuner, pretrained};

// Use pre-trained weights (8.2% MAPE, 10,000+ samples)
let tuner = BrickTuner::with_pretrained();
println!("Version: {}", tuner.version()); // "1.0.0-pretrained"
println!("MAPE: {:.1}%", tuner.throughput_mape() * 100.0);

// Access feature importance for explainability
for (idx, name, importance) in pretrained::FEATURE_IMPORTANCE.iter().take(5) {
    println!("{}: {:.1}%", name, importance * 100.0);
}
// Output:
//   batch_size: 28.0%
//   gpu_mem_bw: 18.0%
//   model_params_b: 14.0%

MLT-11: First-Run Calibration

Calibrate to local hardware (requires hardware-detect feature):

use trueno::tuner::BrickTuner;

let mut tuner = BrickTuner::with_pretrained();

#[cfg(feature = "hardware-detect")]
{
    let result = tuner.calibrate()?;
    println!("Calibrated to: {}", result.hardware_id);
    println!("Local MAPE: {:.1}%", result.local_mape * 100.0);
    println!("Improvement: {:.1}%", result.improvement_pct);
}

MLT-12: Online Learning

Continuously improve predictions during inference:

use trueno::tuner::{BrickTuner, TunerFeatures, QuantType};

let tuner = BrickTuner::with_pretrained();
let mut learner = tuner.online_learner();

// During inference loop
let features = TunerFeatures::builder()
    .model_params_b(7.0)
    .batch_size(4)
    .quant_type(QuantType::Q4K)
    .gpu_mem_bw_gbs(1000.0)
    .build();

for measured_tps in [150.0, 155.0, 152.0, 158.0] {
    learner.observe(&features.to_vector(), measured_tps);
}

println!("Updates: {}", learner.num_updates());
println!("EMA Loss: {:.4}", learner.ema_loss());
println!("Converging: {}", learner.is_converging());

// Apply learned weights back to tuner
let mut updated_tuner = tuner.clone();
updated_tuner.apply_online_updates(&learner);

MLT-13: Bandit Kernel Selection

Explore vs exploit kernel choices using UCB1 or Thompson Sampling:

use trueno::tuner::{BrickTuner, KernelBandit, TunerFeatures, QuantType};

let tuner = BrickTuner::with_pretrained();
let mut bandit = tuner.kernel_bandit(); // UCB1 by default
// Or: let mut bandit = KernelBandit::with_thompson_sampling();

let features = TunerFeatures::builder()
    .model_params_b(7.0)
    .batch_size(4)
    .quant_type(QuantType::Q4K)
    .build();

// Production loop with exploration
for _ in 0..100 {
    let rec = tuner.recommend_kernel_with_exploration(&features, &bandit, 0.2);

    // Use rec.top_kernel for inference...
    let measured_tps = /* actual measurement */ 150.0;

    // Update bandit with normalized reward
    let reward = (measured_tps / 200.0).min(1.0);
    bandit.update(rec.top_kernel, reward);
}

println!("Best kernel: {:?}", bandit.best_kernel());
println!("Exploration rate: {:.2}", bandit.exploration_rate());
println!("Cumulative regret: {:.2}", bandit.estimated_regret());

Running the Evolution Demo

# Phase 14 demo (pre-trained + online learning + bandits)
cargo run --example ml_tuner_evolution

# With hardware calibration
cargo run --example ml_tuner_evolution --features hardware-detect

Further Reading