Benchmarks

Real measurements on NVIDIA H200. No inflated claims, just verified data.

All tests conducted February 2026 on io.net H200 infrastructure. Results verified on v1.4.2.

Cold Start Performance

Time from process start to first token generation. Critical for serverless and auto-scaling.

Qwen 2.5 7B Instruct (H200)

MethodLoad TimeVRAMThroughput
.zse v1.4.2 (bnb.matmul_4bit)9.1s5.9 GB58.7 tok/s
5.57 GB
File Size
5.9 GB
VRAM Usage
58.7
tok/s

Qwen 2.5 32B Instruct (H200)

MethodLoad TimeVRAMThroughput
.zse v1.4.2 (bnb.matmul_4bit)24.1s20.9 GB26.9 tok/s
19.23 GB
File Size
20.9 GB
VRAM Usage
26.9
tok/s

✓ Fits 24GB GPUs! Run 32B models on consumer RTX 3090/4090 cards.

Methodology

Hardware

  • • NVIDIA H200 141GB (Primary benchmark GPU)
  • • NVIDIA A100-80GB (Secondary)
  • • CPU: AMD EPYC / Intel Xeon
  • • CUDA 12.1+, PyTorch 2.1+

Test Conditions

  • • Cold start: Fresh process, no cached weights
  • • Warm cache: Model weights on disk, GPU free
  • • Memory: PyTorch memory profiler
  • • Throughput: Average of 5 runs, 256 output tokens

Run Your Own Benchmarks

Reproduce these results on your hardware with our benchmark suite.