HBM3e memory
AIVory · GPU Marketplace
Live spot pricingNVIDIA B200 — Blackwell firepower for frontier-scale AI.
The successor to the H100 throne. 180 GB of HBM3e, 8.0 TB/s memory bandwidth, and second-generation transformer engine make the B200 the first single GPU that can hold a 200B+ dense model in full precision. Spot rates start at $6.48/hr — a fraction of on-demand pricing for the most powerful GPU on the market.
At a glance
B200 specifications.
Key hardware specs that determine what workloads this GPU handles.
peak throughput
thermal design power
NVIDIA GPU architecture
Spot pricing
B200: live hourly rates.
Every provider offering this GPU on the spot market, sorted cheapest first.
Prices in USD per GPU-hour · spot instances · sorted cheapest first
Recommended models
AI models that run well on B200.
Tested model-GPU pairings with notes on why each is a good fit.
Use cases
What the B200 is built for.
-
Running 200B+ dense models on a single GPU
Before the B200, deploying Qwen 3 235B or similar frontier models required multi-GPU clusters with NVLink. At 180 GB HBM3e, the B200 fits these models in FP16 on one card — cutting infrastructure complexity and inter-GPU communication overhead while spot pricing at $6.48/hr keeps costs manageable.
-
Full-precision fine-tuning of 70B models
Fine-tuning Llama 3.3 70B in BF16 with optimizer states requires ~210 GB of memory on older GPUs, demanding at least 3x A100 80GB. The B200 handles the full training loop on a single card with Adam optimizer states, eliminating the coordination overhead of distributed training.
-
High-context inference with million-token windows
Long-context models like Gemini or GPT-4 with 1M+ context windows generate enormous KV caches. The B200's 180 GB VRAM and 8 TB/s bandwidth sustain high-throughput serving even at 128K+ context lengths where other GPUs choke on memory pressure.
FAQ
Common questions.
How does the B200 compare to the H100 for LLM inference?
The B200 delivers roughly 2.5x the inference throughput of an H100 on transformer workloads, thanks to the second-generation transformer engine and FP4 support. Memory capacity jumps from 80 GB to 180 GB, and bandwidth from 3.35 TB/s to 8.0 TB/s. For models that fit on a single H100, the B200 serves more concurrent requests; for models that required multi-H100 setups, the B200 may run them on one card.
Is the B200 worth the price premium over the H200?
At $6.48/hr vs $4.71/hr for the H200, the B200 costs ~38% more per hour. But it delivers roughly 2x the compute throughput and 28% more VRAM (180 vs 141 GB). If your model is compute-bound (high batch sizes, production throughput targets), the B200's cost-per-token is lower. If you're memory-bound on a model that fits on the H200, the H200 is the better value.
What cooling and power infrastructure does the B200 require?
The B200 draws up to 1000W TDP and requires liquid cooling in most data center configurations. This is handled by cloud providers — when you rent a B200 spot instance, the provider manages power and thermal delivery. You don't need to worry about infrastructure; you pay the hourly rate and get a ready-to-use GPU.
Can I use the B200 for training, not just inference?
Absolutely. The B200's FP8 and FP4 tensor cores are designed for both training and inference. A single B200 can fine-tune 70B models that previously required multi-GPU setups. For pre-training at scale, B200 clusters with fifth-generation NVLink deliver near-linear scaling. Spot pricing makes short training runs economical — rent 8 B200s for a few hours instead of committing to reserved instances.
Rent a B200. Right now.
Spot pricing, per-second billing, no commitment.
Browse the live marketplace, pick your GPU, deploy in one click. Credits from $10.