Text Inference

Helix supports any model available through Ollama or vLLM. The platform dynamically detects model types—if a model name contains a colon (like qwen3:8b), it runs through Ollama. Models with a slash prefix (like Qwen/Qwen2.5-VL-7B-Instruct) run through vLLM.
Listing Available Models
Query the Helix API to see which models are configured on your instance:
curl -s https://your-helix-server/v1/models \
-H "Authorization: Bearer $HELIX_API_KEY" | jqOr using the CLI:
helix model listAdding Models
Administrators can add any Ollama or vLLM model. For Ollama models, use the standard Ollama tag format (e.g., llama3:instruct, qwen3:32b, mixtral:instruct). For vLLM models, use the HuggingFace model path (e.g., Qwen/Qwen2.5-VL-7B-Instruct).
curl -X POST https://your-helix-server/api/v1/helix-models \
-H "Authorization: Bearer $HELIX_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"id": "qwen3:14b",
"name": "Qwen3 14B",
"type": "chat",
"runtime": "ollama",
"description": "Mid-size Qwen3 model with strong reasoning",
"context_length": 40960,
"enabled": true
}'Models can also be configured in the Helix admin interface. Each model can specify memory requirements, context length, concurrency limits, and whether to prewarm on runners.
Running Inference
Send a chat completion request to the OpenAI-compatible API:
curl https://your-helix-server/v1/chat/completions \
-H "Authorization: Bearer $HELIX_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"model": "qwen3:8b",
"messages": [{"role": "user", "content": "Explain quantum computing in simple terms"}]
}'The scheduler automatically routes requests to available runners with sufficient GPU memory for the requested model.