First Server
Serve a model as an HTTP API:
apr serve model.gguf --port 8080
Then query it:
curl http://localhost:8080/v1/completions \
-H "Content-Type: application/json" \
-d '{"prompt": "What is 2+2?", "max_tokens": 32}'
OpenAI-Compatible API
The server implements the OpenAI completions API:
# Chat completions
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello"}],
"max_tokens": 100
}'