Introduction

Alimentar ("to feed" in Spanish) is a pure Rust data loading, transformation, and distribution library for the paiml sovereign AI stack. It provides HuggingFace-compatible functionality with sovereignty-first design.

Why Alimentar?

The modern ML ecosystem often requires cloud connectivity, Python dependencies, and complex FFI bridges. Alimentar takes a different approach:

  • Sovereign-first - Local storage by default, no mandatory cloud dependency
  • Pure Rust - No Python, no FFI (fully WASM-compatible)
  • Zero-copy - Arrow RecordBatch throughout for maximum efficiency
  • Ecosystem aligned - Arrow 53, Parquet 53 (matches trueno, aprender)

Key Features

Data Loading

Load data from multiple sources with a unified API:

use alimentar::{ArrowDataset, DataLoader};

// Load from various formats
let csv_data = ArrowDataset::from_csv("data.csv", None)?;
let json_data = ArrowDataset::from_json("data.json", None)?;
let parquet_data = ArrowDataset::from_parquet("data.parquet")?;

Transformations

Apply chainable transformations to your data:

use alimentar::{Dataset, Select, Filter, Normalize, Chain};

let dataset = ArrowDataset::from_parquet("train.parquet")?
    .with_transform(Chain::new(vec![
        Box::new(Select::new(vec!["feature1", "feature2", "label"])),
        Box::new(Normalize::zscore(vec!["feature1", "feature2"])),
    ]));

DataLoader

Iterate over batches with shuffling support:

let loader = DataLoader::new(dataset)
    .batch_size(32)
    .shuffle(true);

for batch in loader {
    // Process batch
    println!("Batch with {} rows", batch.num_rows());
}

Storage Backends

Store and retrieve datasets from multiple backends:

use alimentar::backend::{LocalBackend, S3Backend, MemoryBackend};

// Local filesystem
let local = LocalBackend::new("/data/datasets")?;

// S3-compatible storage
let s3 = S3Backend::builder()
    .bucket("my-datasets")
    .region("us-west-2")
    .build()
    .await?;

// In-memory (for WASM/testing)
let memory = MemoryBackend::new();

Registry

Publish and discover datasets:

use alimentar::Registry;

let registry = Registry::new("/data/registry")?;

// Publish a dataset
registry.publish("my-dataset", dataset, metadata)?;

// Pull a dataset
let dataset = registry.pull("my-dataset", None)?;

// Search datasets
let results = registry.search("classification")?;

Architecture

┌─────────────────────────────────────────────────────────────┐
│                        alimentar                            │
├─────────────────────────────────────────────────────────────┤
│  Importers          │  Core            │  Exporters         │
│  ─────────          │  ────            │  ─────────         │
│  • HuggingFace Hub  │  • Dataset       │  • Local FS        │
│  • Local files      │  • DataLoader    │  • S3-compatible   │
│  • S3-compatible    │  • Transforms    │  • Registry API    │
│  • HTTP/HTTPS       │  • Streaming     │                    │
└─────────────────────────────────────────────────────────────┘
                              │
        ┌─────────────────────┼─────────────────────┐
        ▼                     ▼                     ▼
   trueno                aprender              assetgen
   (SIMD/GPU)            (ML/DL)              (Content)

Quick Example

Here's a complete example of a typical ML data pipeline:

use alimentar::{
    ArrowDataset, DataLoader, Dataset,
    Select, FillNull, FillStrategy, Normalize, Chain,
};

fn main() -> alimentar::Result<()> {
    // Load training data
    let dataset = ArrowDataset::from_parquet("train.parquet")?;

    // Apply preprocessing transforms
    let processed = dataset.with_transform(Chain::new(vec![
        // Select relevant columns
        Box::new(Select::new(vec!["age", "income", "score", "label"])),
        // Handle missing values
        Box::new(FillNull::new("age", FillStrategy::Mean)),
        Box::new(FillNull::new("income", FillStrategy::Median)),
        // Normalize features
        Box::new(Normalize::zscore(vec!["age", "income", "score"])),
    ]));

    // Create data loader with batching and shuffling
    let loader = DataLoader::new(processed)
        .batch_size(64)
        .shuffle(true);

    // Iterate over batches for training
    for batch in loader {
        println!("Training on batch with {} rows", batch.num_rows());
        // Train your model here
    }

    Ok(())
}

Next Steps