Benchmarks
Real measurements on NVIDIA H200. No inflated claims, just verified data.
All tests conducted February 2026 on io.net H200 infrastructure. Results verified on v1.4.2.
Cold Start Performance
Time from process start to first token generation. Critical for serverless and auto-scaling.
Qwen 2.5 7B Instruct (H200)
| Method | Load Time | VRAM | Throughput |
|---|---|---|---|
| .zse v1.4.2 (bnb.matmul_4bit) | 9.1s | 5.9 GB | 58.7 tok/s |
5.57 GB
File Size
5.9 GB
VRAM Usage
58.7
tok/s
Qwen 2.5 32B Instruct (H200)
| Method | Load Time | VRAM | Throughput |
|---|---|---|---|
| .zse v1.4.2 (bnb.matmul_4bit) | 24.1s | 20.9 GB | 26.9 tok/s |
19.23 GB
File Size
20.9 GB
VRAM Usage
26.9
tok/s
✓ Fits 24GB GPUs! Run 32B models on consumer RTX 3090/4090 cards.
Methodology
Hardware
- • NVIDIA H200 141GB (Primary benchmark GPU)
- • NVIDIA A100-80GB (Secondary)
- • CPU: AMD EPYC / Intel Xeon
- • CUDA 12.1+, PyTorch 2.1+
Test Conditions
- • Cold start: Fresh process, no cached weights
- • Warm cache: Model weights on disk, GPU free
- • Memory: PyTorch memory profiler
- • Throughput: Average of 5 runs, 256 output tokens
Run Your Own Benchmarks
Reproduce these results on your hardware with our benchmark suite.