Interactive chat with language model
Usage: apr chat [OPTIONS] <FILE>
Arguments:
<FILE> Path to .apr model file
Options:
--temperature <TEMPERATURE> Sampling temperature (0 = greedy, higher = more random) [default:
0.7]
--top-p <TOP_P> Nucleus sampling threshold [default: 0.9]
--max-tokens <MAX_TOKENS> Maximum tokens to generate per response [default: 512]
--system <SYSTEM> System prompt to set model behavior
--inspect Show inspection info (top-k probs, tokens/sec)
--no-gpu Disable GPU acceleration (use CPU)
--gpu Force GPU acceleration (requires CUDA)
--trace Enable inference tracing (APR-TRACE-001)
--trace-steps <TRACE_STEPS> Trace specific steps only (comma-separated)
--trace-verbose Verbose tracing
--trace-output <FILE> Save trace output to JSON file
--trace-level <LEVEL> Trace detail level (none, basic, layer, payload) [default: basic]
--profile Enable inline Roofline profiling (PMAT-SHOWCASE-METHODOLOGY-001)
--backend <BACKEND> PMAT-488: Compute backend override (cuda, cpu, wgpu)
--json Output as JSON
-v, --verbose Verbose output
-q, --quiet Quiet mode (errors only)
--offline Disable network access (Sovereign AI compliance, Section 9)
--skip-contract Skip tensor contract validation (PMAT-237: use with diagnostic
tooling)
-h, --help Print help
-V, --version Print version