GPU Spot Marketplace
Rent spot GPUs directly through the Smart Inference API. Browse live offers from multiple providers, spin up a GPU in one API call, and tear it down when you’re done. Same API key, same billing — no separate accounts needed.
- One API call to browse offers and rent a GPU
- Spot pricing — typically 60-80% cheaper than on-demand
- Multi-provider — aggregated offers across GPU clouds
- Same billing — deducted from your Smart Inference credit balance
When to use GPU rental
Most Smart Inference requests are routed to hosted API providers automatically — you never touch a GPU. GPU rental is for two cases:
- Cold models — you want to run a HuggingFace model that isn’t in the hosted catalogue. The first request triggers a spot GPU deployment, or you can rent one explicitly.
- Dedicated capacity — you need a GPU for a sustained workload (fine-tuning, batch inference, custom serving) and want to control the instance directly.
Availability tiers
Every model in the Smart Inference catalogue has an availability status. You can check it via GET /v1/models.
| Tier | Meaning |
|---|---|
instant |
Served by hosted API providers or an active spot GPU. No wait time. |
deploying |
A spot GPU is being provisioned. estimated_deploy_seconds gives the ETA. |
cold |
Can be deployed on a spot GPU. The first request triggers deployment (3-10 min) and places a credit hold. |
Models with instant availability don’t need GPU rental — the router handles them. Models marked cold or deploying involve spot GPUs.
Cold-start flow
When you send a request to a cold model:
- A credit hold is placed for the minimum session cost (typically $0.20-$0.50)
- A spot GPU is provisioned and the model is loaded (3-10 minutes)
- You receive a
202response with aRetry-Afterheader - Once the GPU is ready, subsequent requests are served immediately
- After 15 minutes of inactivity, the GPU is automatically terminated
{
"status": "deploying",
"message": "Model 'meta-llama/Llama-3.3-70B-Instruct' is being deployed on a spot GPU. Estimated ready in ~5 minutes.",
"type": "model_deploying",
"estimated_ready_seconds": 300
}
Note
The OpenAI SDK’s default retry logic does not handle 202 responses. Add a wrapper with a retry loop that checks the
Retry-After header if you depend on cold-starting models.