Introduction
ZSE (Z Server Engine) is an ultra memory-efficient LLM inference engine designed for fast cold starts and low memory usage.
What is ZSE?
ZSE is an inference engine that loads large language models in seconds, not minutes. It achieves this through pre-quantized model formats that skip runtime quantization entirely.
Why ZSE?
.zse format eliminates this overhead, enabling sub-4-second cold starts for 7B models.Whether you're building serverless AI endpoints, developing locally, or deploying to production, ZSE helps you iterate faster and reduce costs.
Key Features
zQuantize
Pre-quantize models to INT4/NF4 format for instant loading
zServe
OpenAI-compatible API server with streaming support
zInfer
CLI tool for quick model testing and inference
zStream
Layer streaming for running large models on limited VRAM
zKV
Quantized KV cache for 4× memory savings
OpenAI API
Drop-in replacement for OpenAI's chat completions API
- 9.1s cold start for 7B models with embedded config/tokenizer
- 24.1s cold start for 32B models
- 58.7 tok/s (7B) and 26.9 tok/s (32B) with bnb.matmul_4bit
- 32B fits on 24GB consumer GPUs (RTX 3090/4090)
- Single .zse file format - no network calls needed
- Streaming token generation
Benchmarks
Performance benchmarks on H200 with Qwen 2.5 Instruct models (v1.2.0):
| Model | File Size | VRAM | Speed | Load Time |
|---|---|---|---|---|
| 7B | 5.57 GB | 5.9 GB | 58.7 tok/s | 9.1s |
| 32B | 19.23 GB | 20.9 GB | 26.9 tok/s | 24.1s |
Quick Install
Install ZSE from PyPI:
pip install zllm-zseFor GGUF model support, install with the optional dependency:
pip install zllm-zse[gguf]Start a server with a pre-trained model:
# Start the serverzse serve Qwen/Qwen2.5-7B-Instruct # Or with custom settingszse serve Qwen/Qwen2.5-7B-Instruct --port 8080 --host 0.0.0.0