Publishing to HuggingFace Hub

alimentar is the only Rust crate with native HuggingFace Hub upload support. The official hf-hub crate only supports downloads.

Critical Warning: Data Quality Before Publishing

WARNING: Publishing low-quality datasets to HuggingFace is HARMFUL to the ML community.

Poor quality training data leads to:

  • Models that learn incorrect patterns
  • Wasted compute resources on garbage training
  • Propagation of errors across downstream models
  • Reduced trust in the dataset ecosystem

ALWAYS validate your data quality before publishing.

Data Quality Checklist

Before uploading ANY dataset to HuggingFace, verify:

1. Run Quality Score

# Check quality score - MINIMUM Grade B (85%) required
alimentar quality score my_dataset.parquet

# Example output:
# Quality Score: 92.3% (Grade A)
# - Completeness: 98% (no null values in critical columns)
# - Uniqueness: 95% (low duplicate rate)
# - Consistency: 89% (format validation passed)
# - Schema: 100% (all types valid)

2. Quality Grade Requirements

GradeScoreRecommendation
A95%+Excellent - safe to publish
B85-94%Good - review warnings before publishing
C70-84%DO NOT PUBLISH - fix issues first
D<70%REJECTED - major quality problems

3. Use Quality Profiles

# Apply domain-specific quality rules
alimentar quality score --profile ml-training data.parquet
alimentar quality score --profile doctest-corpus doctests.parquet
alimentar quality score --profile code-translation code.parquet

Improving Data Quality

Recipe 1: Clean with aprender

aprender provides ML-focused data cleaning:

# Install aprender
cargo install aprender

# Clean dataset with ML-aware transforms
aprender clean input.parquet --output cleaned.parquet \
    --remove-nulls \
    --deduplicate \
    --validate-types \
    --normalize-text

# Verify improvement
alimentar quality score cleaned.parquet

Recipe 2: Augment with entrenar

entrenar provides training-focused transforms:

# Install entrenar
cargo install entrenar

# Augment dataset for training
entrenar augment input.parquet --output augmented.parquet \
    --balance-classes \
    --add-noise 0.1 \
    --synthetic-samples 1000

# Verify quality maintained
alimentar quality score augmented.parquet

Recipe 3: Full Pipeline

#!/bin/bash
# quality_pipeline.sh - MANDATORY before HF Hub publishing

set -e  # Exit on any error

INPUT="$1"
OUTPUT="$2"
REPO="$3"

echo "=== Step 1: Initial Quality Check ==="
INITIAL=$(alimentar quality score "$INPUT" --json | jq '.score')
echo "Initial quality: $INITIAL%"

if (( $(echo "$INITIAL < 70" | bc -l) )); then
    echo "ERROR: Initial quality too low. Cleaning required."

    echo "=== Step 2: Clean with aprender ==="
    aprender clean "$INPUT" --output /tmp/cleaned.parquet

    echo "=== Step 3: Validate cleaning ==="
    CLEANED=$(alimentar quality score /tmp/cleaned.parquet --json | jq '.score')
    echo "After cleaning: $CLEANED%"

    INPUT="/tmp/cleaned.parquet"
fi

echo "=== Step 4: Final Quality Gate ==="
FINAL=$(alimentar quality score "$INPUT" --json | jq '.score')

if (( $(echo "$FINAL < 85" | bc -l) )); then
    echo "FATAL: Quality score $FINAL% below 85% threshold"
    echo "DO NOT PUBLISH - fix data quality issues first"
    exit 1
fi

echo "=== Step 5: Publish to HuggingFace ==="
alimentar hub push "$INPUT" "$REPO" \
    --readme /tmp/readme.md \
    --message "Quality-validated upload (score: $FINAL%)"

echo "SUCCESS: Published with quality score $FINAL%"

CLI Usage

Basic Upload

# Set your HuggingFace token
export HF_TOKEN="hf_xxxxx"

# Upload parquet file
alimentar hub push data.parquet paiml/my-dataset

# Upload with custom path
alimentar hub push train.parquet paiml/my-dataset \
    --path-in-repo data/train.parquet

# Upload with README
alimentar hub push data.parquet paiml/my-dataset \
    --readme README.md \
    --message "Initial upload with doctest corpus"

With Quality Enforcement

# Recommended: Check quality before publishing
alimentar quality score data.parquet && \
alimentar hub push data.parquet paiml/my-dataset

API Usage

use alimentar::hf_hub::HfPublisher;

// Create publisher
let publisher = HfPublisher::new("paiml/my-dataset")
    .with_token(std::env::var("HF_TOKEN").unwrap())
    .with_commit_message("Upload quality-validated corpus");

// Upload parquet (uses LFS for binary files)
publisher.upload_parquet_file_sync(
    Path::new("data.parquet"),
    "data/train.parquet"
)?;

// Upload README (validates dataset card)
publisher.upload_readme_validated_sync(&readme_content)?;

Technical Details

File Type Detection

File TypeMethodAPI
Text (.md, .json, .csv)Direct NDJSON/api/datasets/{repo}/commit/main
Binary (.parquet, .arrow, .png)LFS Batch/datasets/{repo}.git/info/lfs/objects/batch

LFS Upload Flow

  1. Compute SHA256 hash of file content
  2. POST to LFS batch API with object OID
  3. Extract presigned S3 URL from response
  4. PUT binary content to S3
  5. POST NDJSON commit with lfsFile reference

Common Issues

"Quality score below threshold"

ERROR: Quality score 72% below 85% threshold

Fix: Run aprender clean to address issues:
  - Remove null values: aprender clean --remove-nulls
  - Fix duplicates: aprender clean --deduplicate
  - Validate types: aprender clean --validate-types

"Invalid task_categories"

ERROR: Invalid 'task_categories': 'text2text-generation' is not valid

Fix: Use valid HuggingFace task category:
  - text-generation
  - translation
  - text-classification

See Dataset Card Validation for valid categories.