Whisper Transcription

Transcribe audio files using whisper.apr with APR v2 model format.

Example

cargo run --example whisper_transcribe

Code

//! Whisper Transcription Example
//!
//! Demonstrates audio transcription with whisper.apr.

use apr_cookbook::prelude::*;

fn main() -> Result<()> {
    let mut ctx = RecipeContext::new("whisper_transcribe")?;

    // Create whisper model (simulated)
    let model = WhisperModel::new(WhisperConfig {
        size: ModelSize::Small,
        quantization: Quantization::Int8,
        language: None, // Auto-detect
    });

    println!("Model: {} ({})", model.name(), model.size_category());
    println!("Quantization: {:?}", model.quantization());

    // Transcribe audio
    let audio = generate_test_audio(16000, 3.0); // 3 seconds at 16kHz
    let result = model.transcribe(&audio, TranscribeOptions::default())?;

    println!("\n=== Transcription ===");
    println!("Text: {}", result.text);
    println!("Language: {} ({:.1}% confidence)",
        result.language, result.language_confidence * 100.0);

    if !result.segments.is_empty() {
        println!("\n=== Segments ===");
        for seg in &result.segments {
            println!("[{:.2}s - {:.2}s] {}", seg.start, seg.end, seg.text);
        }
    }

    ctx.record_float_metric("confidence", result.language_confidence as f64);
    ctx.report()?;

    Ok(())
}

Key Features

Language Detection

Whisper automatically detects the spoken language:

let result = model.transcribe(&audio, TranscribeOptions::default())?;
println!("Detected: {} ({:.1}%)", result.language, result.confidence * 100.0);

Timestamps

Enable word-level timestamps:

let options = TranscribeOptions {
    with_timestamps: true,
    word_timestamps: true,
    ..Default::default()
};

let result = model.transcribe(&audio, options)?;
for word in &result.words {
    println!("[{:.2}s] {}", word.start, word.text);
}

Quantized Models

Use Int4/Int8 quantization for smaller models:

// Load Int8 quantized model (4x smaller)
const MODEL: &[u8] = include_bytes!("whisper-small-int8.apr");
let model = WhisperModel::from_apr_bytes(MODEL)?;

// Load Int4 quantized model (8x smaller)
const MODEL_Q4: &[u8] = include_bytes!("whisper-small-int4.apr");
let model_q4 = WhisperModel::from_apr_bytes(MODEL_Q4)?;

Falsifiable Claims

F5: whisper.apr Int8 model achieves WER <10% on LibriSpeech test-clean.

#[test]
fn f5_speech_recognition_wer() {
    let wer = calculate_wer(reference, hypothesis);
    assert!(wer < 0.12, "FALSIFIED: WER {:.2}% > 12%", wer * 100.0);
}

Tests

#[test]
fn test_language_detection() {
    let model = WhisperModel::new(Default::default());
    let audio = generate_english_audio();

    let result = model.transcribe(&audio, Default::default()).unwrap();
    assert_eq!(result.language, "en");
}

#[test]
fn test_timestamp_consistency() {
    let model = WhisperModel::new(Default::default());
    let audio = generate_test_audio(16000, 5.0);

    let result = model.transcribe(&audio, TranscribeOptions {
        with_timestamps: true,
        ..Default::default()
    }).unwrap();

    // Timestamps should be monotonically increasing
    for window in result.segments.windows(2) {
        assert!(window[0].end <= window[1].start);
    }
}