Smart Inference

GPU Spot Marketplace

Rent spot GPUs directly through the Smart Inference API. Browse live offers from multiple providers, spin up a GPU in one API call, and tear it down when you’re done. Same API key, same billing — no separate accounts needed.

One API call to browse offers and rent a GPU
Spot pricing — typically 60-80% cheaper than on-demand
Multi-provider — aggregated offers across GPU clouds
Same billing — deducted from your Smart Inference credit balance

When to use GPU rental

Most Smart Inference requests are routed to hosted API providers automatically — you never touch a GPU. GPU rental is for two cases:

Cold models — you want to run a HuggingFace model that isn’t in the hosted catalogue. The first request triggers a spot GPU deployment, or you can rent one explicitly.
Dedicated capacity — you need a GPU for a sustained workload (fine-tuning, batch inference, custom serving) and want to control the instance directly.

Availability tiers

Every model in the Smart Inference catalogue has an availability status. You can check it via GET /v1/models.

Tier	Meaning
`instant`	Served by hosted API providers or an active spot GPU. No wait time.
`deploying`	A spot GPU is being provisioned. `estimated_deploy_seconds` gives the ETA.
`cold`	Can be deployed on a spot GPU. The first request triggers deployment (3-10 min) and places a credit hold.

Models with instant availability don’t need GPU rental — the router handles them. Models marked cold or deploying involve spot GPUs.

Cold-start flow

When you send a request to a cold model:

A credit hold is placed for the minimum session cost (typically $0.20-$0.50)
A spot GPU is provisioned and the model is loaded (3-10 minutes)
You receive a 202 response with a Retry-After header
Once the GPU is ready, subsequent requests are served immediately
After 15 minutes of inactivity, the GPU is automatically terminated

{
  "status": "deploying",
  "message": "Model 'meta-llama/Llama-3.3-70B-Instruct' is being deployed on a spot GPU. Estimated ready in ~5 minutes.",
  "type": "model_deploying",
  "estimated_ready_seconds": 300
}

Note The OpenAI SDK’s default retry logic does not handle 202 responses. Add a wrapper with a retry loop that checks the Retry-After header if you depend on cold-starting models.

Next steps

GPU API reference

Endpoints for browsing, renting, and managing GPUs

GPU billing

Credit holds, per-GPU pricing, and idle timeouts

Quickstart

Get your API key and send your first request