Text Inference

Helix chat interface showing model selection and conversation

Helix supports any model available through Ollama or vLLM. The platform dynamically detects model types—if a model name contains a colon (like qwen3:8b), it runs through Ollama. Models with a slash prefix (like Qwen/Qwen2.5-VL-7B-Instruct) run through vLLM.

Listing Available Models

Query the Helix API to see which models are configured on your instance:

curl -s https://your-helix-server/v1/models \
  -H "Authorization: Bearer $HELIX_API_KEY" | jq

Or using the CLI:

helix model list

Adding Models

Administrators can add any Ollama or vLLM model. For Ollama models, use the standard Ollama tag format (e.g., llama3:instruct, qwen3:32b, mixtral:instruct). For vLLM models, use the HuggingFace model path (e.g., Qwen/Qwen2.5-VL-7B-Instruct).

curl -X POST https://your-helix-server/api/v1/helix-models \
  -H "Authorization: Bearer $HELIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "id": "qwen3:14b",
    "name": "Qwen3 14B",
    "type": "chat",
    "runtime": "ollama",
    "description": "Mid-size Qwen3 model with strong reasoning",
    "context_length": 40960,
    "enabled": true
  }'

Models can also be configured in the Helix admin interface. Each model can specify memory requirements, context length, concurrency limits, and whether to prewarm on runners.

Running Inference

Send a chat completion request to the OpenAI-compatible API:

curl https://your-helix-server/v1/chat/completions \
  -H "Authorization: Bearer $HELIX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "model": "qwen3:8b",
    "messages": [{"role": "user", "content": "Explain quantum computing in simple terms"}]
  }'

The scheduler automatically routes requests to available runners with sufficient GPU memory for the requested model.

Last updated on