Case Study: Shell Completion Benchmarks

Sub-millisecond recommendation latency verification using trueno-style criterion benchmarks.

The 10ms UX Threshold

Human perception research (Nielsen, 1993) establishes response time thresholds:

LatencyUser Perception
< 100msInstant
100-1000msNoticeable delay
> 1000msFlow interruption

For shell completion, the bar is higher:

  • Users type 5-10 keystrokes per second
  • Each keystroke needs a suggestion update
  • Target: < 10ms for seamless experience

Benchmark Architecture

The recommendation_latency benchmark follows trueno-style patterns:

//! Performance targets:
//! - Small (~50 commands): <1ms train, <1ms suggest
//! - Medium (~500 commands): <5ms suggest
//! - Large (~5000 commands): <10ms suggest

criterion_group!(
    name = latency_benchmarks;
    config = Criterion::default()
        .sample_size(100)
        .measurement_time(Duration::from_secs(5));
    targets =
        bench_suggestion_latency,
        bench_partial_completion,
        bench_training_throughput,
        bench_cold_start,
        bench_serialization,
        bench_scalability,
        bench_paged_model,
);

Running Benchmarks

# Full benchmark suite
cargo bench --package aprender-shell --bench recommendation_latency

# Specific group
cargo bench --package aprender-shell -- suggestion_latency

# Quick validation (no stats)
cargo bench --package aprender-shell -- --test

Results Analysis

Suggestion Latency

Core metric—time from prefix input to suggestion output.

suggestion_latency/small/prefix/git
                        time:   [1.5345 µs 1.5419 µs 1.5497 µs]
suggestion_latency/small/prefix/kubectl
                        time:   [435.65 ns 437.51 ns 439.58 ns]
suggestion_latency/medium/prefix/git
                        time:   [10.586 µs 10.639 µs 10.694 µs]
suggestion_latency/large/prefix/git
                        time:   [14.399 µs 14.591 µs 14.840 µs]

Analysis:

  • Small model: 437 ns - 1.5 µs (6,500-22,000x under target)
  • Medium model: 1.8 - 10.6 µs (940-5,500x under target)
  • Large model: 671 ns - 14.6 µs (685-14,900x under target)

Scalability

How does latency grow with model size?

scalability/suggest_git/100    time: [1.2 µs]
scalability/suggest_git/500    time: [3.8 µs]
scalability/suggest_git/1000   time: [5.2 µs]
scalability/suggest_git/2000   time: [8.1 µs]
scalability/suggest_git/3000   time: [11.4 µs]
scalability/suggest_git/3790   time: [14.2 µs]

Growth pattern: Sub-linear O(log n), not linear O(n).

Training Throughput

Commands processed per second during model training.

training_throughput/small/46 cmds
                        throughput: [180,000 elem/s]
training_throughput/medium/265 cmds
                        throughput: [85,000 elem/s]
training_throughput/large/3790 cmds
                        throughput: [42,000 elem/s]

Analysis:

  • Small histories: 180K commands/second
  • Large histories: 42K commands/second
  • A 10K command history trains in ~240ms

Cold Start

Time from model load to first suggestion.

cold_start/load_and_suggest
                        time:   [2.8 ms 2.9 ms 3.0 ms]

Analysis: Under 3ms for load + suggest. Shell startup impact is negligible.

Serialization

Model persistence performance.

serialization/serialize_json
                        time:   [1.2 ms]
                        throughput: [450 KB/s]
serialization/deserialize_json
                        time:   [2.1 ms]

Analysis: JSON serialization is fast enough for export/import workflows.

Comparison with Other Tools

ToolSuggestion Latencyaprender-shell Speedup
GitHub Copilot100-500ms10,000-50,000x
TabNine50-200ms5,000-20,000x
Fish shell5-20ms500-2,000x
Zsh compinit10-50ms1,000-5,000x
Bash completion20-100ms2,000-10,000x
aprender-shell0.4-15 µsBaseline

Why Microsecond Latency?

1. Data Structure Choice

Trie (O(k) lookup, k = prefix length)
├── g─i─t─ ─s─t─a─t─u─s
├── c─a─r─g─o─ ─b─u─i─l─d
└── d─o─c─k─e─r─ ─p─s

vs. Linear scan (O(n), n = vocabulary size)

2. No Neural Network

OperationN-gramTransformer
Matrix multiply❌ None✅ O(n²)
Attention❌ None✅ O(n²)
Softmax❌ None✅ O(vocab)
Embedding lookup❌ None✅ O(1)

3. Memory Layout

// Hot path: single HashMap lookup + Trie traversal
let context = self.ngrams.get(&prefix);  // O(1)
let completions = self.trie.find(prefix); // O(k)

No pointer chasing, cache-friendly sequential access.

4. Zero Allocations

Suggestion hot path reuses pre-allocated buffers:

// Pre-allocated result vector
let mut suggestions: Vec<Suggestion> = Vec::with_capacity(5);

Fixture Design

Benchmarks use realistic developer history fixtures:

Small (46 commands)

git status
git add .
git commit -m "Initial commit"
cargo build
cargo test
docker ps
kubectl get pods

Medium (265 commands)

Full developer workflow: git, cargo, docker, kubectl, npm, python, aws, terraform, etc.

Large (3,790 commands)

Production-scale with repeated patterns:

  • 200 git workflow iterations
  • 150 cargo development cycles
  • 100 docker operations
  • 80 kubectl management commands

Adding Custom Benchmarks

Extend the benchmark suite:

fn bench_custom_prefix(c: &mut Criterion) {
    use aprender_shell::MarkovModel;

    let mut group = c.benchmark_group("custom");

    let cmds = parse_commands(MEDIUM_HISTORY);
    let mut model = MarkovModel::new(3);
    model.train(&cmds);

    // Add your prefix
    group.bench_function("my_prefix", |b| {
        b.iter(|| {
            model.suggest(black_box("my-custom-command "), 5)
        });
    });

    group.finish();
}

CI Integration

Add to .github/workflows/benchmark.yml:

- name: Run shell benchmarks
  run: |
    cargo bench --package aprender-shell -- --noplot

- name: Upload results
  uses: actions/upload-artifact@v3
  with:
    name: shell-benchmarks
    path: target/criterion

Key Takeaways

  1. 10ms target easily met: Worst case 14.6 µs = 685x headroom
  2. Scales sub-linearly: O(log n) not O(n)
  3. Cold start negligible: <3ms including model load
  4. No neural overhead: Simple data structures win for pattern matching
  5. Production ready: 5000+ command histories handled efficiently

References

  • Nielsen, J. (1993). Response Times: The 3 Important Limits
  • trueno benchmark patterns: ../trueno/benches/vector_ops.rs
  • Criterion documentation: https://bheisler.github.io/criterion.rs/