Smart Inference

GPU Spot Marketplace

Rent spot GPUs directly through the Smart Inference API. Browse live offers from multiple providers, spin up a GPU in one API call, and tear it down when you’re done. Same API key, same billing — no separate accounts needed.

  • One API call to browse offers and rent a GPU
  • Spot pricing — typically 60-80% cheaper than on-demand
  • Multi-provider — aggregated offers across GPU clouds
  • Same billing — deducted from your Smart Inference credit balance

When to use GPU rental

Most Smart Inference requests are routed to hosted API providers automatically — you never touch a GPU. GPU rental is for two cases:

  1. Cold models — you want to run a HuggingFace model that isn’t in the hosted catalogue. The first request triggers a spot GPU deployment, or you can rent one explicitly.
  2. Dedicated capacity — you need a GPU for a sustained workload (fine-tuning, batch inference, custom serving) and want to control the instance directly.

Availability tiers

Every model in the Smart Inference catalogue has an availability status. You can check it via GET /v1/models.

Tier Meaning
instant Served by hosted API providers or an active spot GPU. No wait time.
deploying A spot GPU is being provisioned. estimated_deploy_seconds gives the ETA.
cold Can be deployed on a spot GPU. The first request triggers deployment (3-10 min) and places a credit hold.

Models with instant availability don’t need GPU rental — the router handles them. Models marked cold or deploying involve spot GPUs.

Cold-start flow

When you send a request to a cold model:

  1. A credit hold is placed for the minimum session cost (typically $0.20-$0.50)
  2. A spot GPU is provisioned and the model is loaded (3-10 minutes)
  3. You receive a 202 response with a Retry-After header
  4. Once the GPU is ready, subsequent requests are served immediately
  5. After 15 minutes of inactivity, the GPU is automatically terminated
{
  "status": "deploying",
  "message": "Model 'meta-llama/Llama-3.3-70B-Instruct' is being deployed on a spot GPU. Estimated ready in ~5 minutes.",
  "type": "model_deploying",
  "estimated_ready_seconds": 300
}
Note The OpenAI SDK’s default retry logic does not handle 202 responses. Add a wrapper with a retry loop that checks the Retry-After header if you depend on cold-starting models.

Next steps