Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Repartir: Distributed Computing

repartir is the Sovereign AI Stack’s distributed computing library, providing CPU, GPU, and remote task execution with work-stealing scheduling.

Overview

AttributeValue
Version1.1.x
crates.iorepartir
docs.rsrepartir
LicenseMIT

Key Features

  • 100% Rust, Zero C/C++: Complete auditability for sovereign AI
  • Work-Stealing Scheduler: Based on Blumofe & Leiserson (1999)
  • Multi-Backend Execution: CPU, GPU, and Remote executors
  • Iron Lotus Quality: 95% coverage, 80% mutation score

Architecture

┌─────────────────────────────────────────────────────────────┐
│                    repartir Pool                            │
├─────────────────────────────────────────────────────────────┤
│                      Scheduler                              │
│              (Work-Stealing, Task Queue)                    │
├─────────────────────────────────────────────────────────────┤
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────────────┐  │
│  │ CpuExecutor │  │ GpuExecutor │  │   RemoteExecutor    │  │
│  │             │  │             │  │                     │  │
│  │  Rayon-like │  │    wgpu     │  │   TCP/TLS           │  │
│  │  AVX2/512   │  │ Vulkan/Metal│  │  Multi-Node         │  │
│  │    NEON     │  │ DX12/WebGPU │  │  Distributed        │  │
│  └─────────────┘  └─────────────┘  └─────────────────────┘  │
└─────────────────────────────────────────────────────────────┘

Feature Flags

FeatureDescription
cpu (default)Local multi-core execution with work-stealing
gpuwgpu GPU compute (Vulkan/Metal/DX12/WebGPU)
remoteTCP-based distributed execution
remote-tlsTLS-secured remote execution
tensortrueno SIMD tensor integration
checkpointtrueno-db + Parquet state persistence
tuiJob flow TUI visualization
fullAll features enabled

Quick Start

Installation

[dependencies]
repartir = { version = "1.1", features = ["cpu"] }

# With GPU support
repartir = { version = "1.1", features = ["cpu", "gpu"] }

# Full distributed with all features
repartir = { version = "1.1", features = ["full"] }

Basic CPU Pool

use repartir::{Pool, task::{Task, Backend}};

#[tokio::main]
async fn main() -> repartir::error::Result<()> {
    // Create pool with 8 CPU workers
    let pool = Pool::builder()
        .cpu_workers(8)
        .build()?;

    // Submit a task
    let task = Task::builder()
        .binary("./worker")
        .arg("--input").arg("data.csv")
        .backend(Backend::Cpu)
        .build()?;

    let result = pool.submit(task).await?;

    if result.is_success() {
        println!("Output: {}", result.stdout_str()?);
    }

    pool.shutdown().await;
    Ok(())
}

GPU Execution

use repartir::executor::gpu::GpuExecutor;
use repartir::executor::Executor;

#[tokio::main]
async fn main() -> repartir::error::Result<()> {
    // Initialize GPU executor (auto-selects best GPU)
    let gpu = GpuExecutor::new().await?;

    println!("GPU: {}", gpu.device_name());
    println!("Compute units: {}", gpu.capacity());

    // GPU selection priority:
    // 1. Discrete GPU (dedicated graphics)
    // 2. Integrated GPU (CPU-integrated)
    // 3. Software rasterizer (fallback)

    Ok(())
}

Multi-Machine Distribution

Step 1: Start workers on each node

# On node1 (192.168.1.10)
repartir-worker --bind 0.0.0.0:9000

# On node2 (192.168.1.11)
repartir-worker --bind 0.0.0.0:9000

# On node3 (192.168.1.12)
repartir-worker --bind 0.0.0.0:9000

Step 2: Connect from coordinator

use repartir::executor::remote::RemoteExecutor;
use repartir::task::{Task, Backend};

#[tokio::main]
async fn main() -> repartir::error::Result<()> {
    // Connect to remote workers
    let executor = RemoteExecutor::builder()
        .add_worker("192.168.1.10:9000")
        .add_worker("192.168.1.11:9000")
        .add_worker("192.168.1.12:9000")
        .build()
        .await?;

    // Task distributed to available worker
    let task = Task::builder()
        .binary("./gpu-workload")
        .arg("--shard=0")
        .backend(Backend::Gpu)
        .build()?;

    let result = executor.execute(task).await?;
    println!("Result: {:?}", result.stdout_str()?);

    Ok(())
}

TLS-Secured Remote Execution

#![allow(unused)]
fn main() {
use repartir::executor::tls::TlsRemoteExecutor;

let executor = TlsRemoteExecutor::builder()
    .add_worker("node1.internal:9443")
    .cert_path("./certs/client.pem")
    .key_path("./certs/client.key")
    .ca_path("./certs/ca.pem")
    .build()
    .await?;
}

SIMD Tensor Operations

With the tensor feature, repartir integrates with trueno for SIMD-accelerated operations:

use repartir::tensor::{TensorExecutor, Tensor};
use repartir::task::Backend;

#[tokio::main]
async fn main() -> repartir::error::Result<()> {
    let executor = TensorExecutor::builder()
        .backend(Backend::Cpu)  // Uses AVX2/AVX-512/NEON
        .build()?;

    let a = Tensor::from_slice(&[1.0, 2.0, 3.0, 4.0]);
    let b = Tensor::from_slice(&[5.0, 6.0, 7.0, 8.0]);

    // SIMD-accelerated operations
    let sum = executor.add(&a, &b).await?;
    let product = executor.mul(&a, &b).await?;
    let dot = executor.dot(&a, &b).await?;

    println!("Sum: {:?}", sum.as_slice());
    println!("Product: {:?}", product.as_slice());
    println!("Dot product: {}", dot);

    Ok(())
}

Checkpointing

With the checkpoint feature, repartir can persist state using trueno-db and Parquet:

#![allow(unused)]
fn main() {
use repartir::checkpoint::CheckpointManager;

let checkpoint = CheckpointManager::new("./checkpoints")?;

// Save state
checkpoint.save("training_epoch_10", &model_state).await?;

// Restore on failure
let state = checkpoint.load("training_epoch_10").await?;
}

Job Flow TUI

Monitor distributed jobs with the TUI dashboard:

cargo run --bin job-flow --features tui,remote
┌─ Job Flow Monitor ─────────────────────────────────────────┐
│ Workers: 3 active   │  Tasks: 45 pending / 120 completed   │
├─────────────────────┴──────────────────────────────────────┤
│ Node                 │ Status  │ Load │ Tasks │ Uptime     │
├──────────────────────┼─────────┼──────┼───────┼────────────┤
│ 192.168.1.10:9000    │ Active  │ 78%  │ 15    │ 2h 34m     │
│ 192.168.1.11:9000    │ Active  │ 65%  │ 18    │ 2h 34m     │
│ 192.168.1.12:9000    │ Active  │ 82%  │ 12    │ 2h 30m     │
└──────────────────────┴─────────┴──────┴───────┴────────────┘

Integration with Batuta

Batuta uses repartir for distributed orchestration:

#![allow(unused)]
fn main() {
use batuta::backend::{select_backend, to_repartir_backend};
use batuta::oracle::types::HardwareSpec;

// MoE router selects optimal backend
let backend = select_backend(
    OpComplexity::High,
    Some(DataSize::samples(1_000_000)),
    &HardwareSpec {
        has_gpu: true,
        is_distributed: true,
        node_count: Some(4),
        ..Default::default()
    },
);

// Map to repartir backend
let repartir_backend = to_repartir_backend(backend);
}

Backend Selection Criteria

Batuta’s MoE router uses the 5x PCIe rule (Gregg & Hazelwood, 2011):

ComplexityScalarSIMDGPU
Low (O(n))<1M>1MNever
Medium (O(n log n))<10K10K-100K>100K
High (O(n³))<1K1K-10K>10K

GPU is beneficial when: compute_time > 5 × transfer_time

Performance Considerations

Work-Stealing Efficiency

The Blumofe & Leiserson work-stealing algorithm provides:

  • O(T₁/P + T∞) expected time with P processors
  • Near-linear speedup for embarrassingly parallel workloads
  • Low contention through randomized stealing

GPU vs CPU Decision

#![allow(unused)]
fn main() {
// Automatic backend selection
let backend = if data_size > 100_000 && complexity == High {
    Backend::Gpu
} else if data_size > 1_000 {
    Backend::Cpu  // SIMD-accelerated
} else {
    Backend::Cpu  // Scalar
};
}

Remote Execution Overhead

  • Serialization: bincode (fast, compact)
  • Network: Length-prefixed TCP messages
  • Latency: ~1ms per task submission (local network)

Comparison with Alternatives

FeaturerepartirRayontokioRay
LanguageRustRustRustPython
GPU SupportYes (wgpu)NoNoYes
DistributedYesNoNoYes
Work-StealingYesYesNoYes
TLSYesN/AYesYes
Pure RustYesYesYesNo

Example: Distributed ML Training

#![allow(unused)]
fn main() {
use repartir::executor::remote::RemoteExecutor;
use repartir::task::{Task, Backend};

async fn distributed_training(
    nodes: &[&str],
    epochs: usize,
) -> repartir::error::Result<()> {
    let executor = RemoteExecutor::builder()
        .add_workers(nodes)
        .build()
        .await?;

    for epoch in 0..epochs {
        // Distribute training shards
        let tasks: Vec<_> = (0..nodes.len())
            .map(|shard| {
                Task::builder()
                    .binary("./train")
                    .arg("--epoch").arg(epoch.to_string())
                    .arg("--shard").arg(shard.to_string())
                    .arg("--total-shards").arg(nodes.len().to_string())
                    .backend(Backend::Gpu)
                    .build()
            })
            .collect::<Result<Vec<_>, _>>()?;

        // Execute in parallel
        for task in tasks {
            let result = executor.execute(task).await?;
            println!("Shard completed: {:?}", result.exit_code());
        }

        println!("Epoch {} complete", epoch);
    }

    Ok(())
}
}

Navigate: Table of Contents | Trueno | Aprender