Smart Inference

GPU API reference

Four endpoints for renting spot GPUs directly. All paths sit under /v1/gpu. Authentication is the same Bearer token used for chat completions.

GET /v1/gpu/offers

Lists available spot GPU offers across all providers, with current pricing and interruption probabilities.

{
  "object": "list",
  "count": 79,
  "data": [
    {
      "gpu_model": "RTX 4090",
      "vram_gb": 24,
      "provider_id": "vast-ai",
      "region": "Quebec, CA",
      "price_per_gpu_hour": "0.287",
      "interruption_probability": 0.02,
      "spot": true
    }
  ]
}

Query filters

Parameter	Description
`?gpu_model=<name>`	Filter by GPU model (e.g. `RTX 4090`, `H100`)
`?provider=<id>`	Only offers from a specific provider
`?min_vram=<gb>`	Minimum VRAM in GB
`?max_price=<hourly>`	Maximum hourly price per GPU

All filters stack.

POST /v1/gpu/rent

Starts a GPU rental. A credit hold is placed for the minimum session cost before provisioning begins.

Request body

{
  "gpu_model": "RTX 4090",
  "provider_id": "vast-ai",
  "region": "Quebec, CA",
  "model_id": "meta-llama/Llama-3.3-70B-Instruct",
  "docker_image": "vllm/vllm-openai:latest",
  "ssh_key": "ssh-ed25519 AAAA... user@host",
  "fine_tuned": false
}

Optional fields

Field	Description
`ssh_key`	Your public SSH key. If provided, the instance is provisioned with SSH access for debugging or custom setup.
`fine_tuned`	Set to `true` if deploying a fine-tuned model variant. The provisioner uses an appropriate serving configuration.