Ultra Memory-Efficient
LLM Inference
Run 32B models on 24GB GPUs.Multi-GPU TP & PP. Built-in RAG with .zpf format.
7B Cold Start
tok/s (7B)
32B VRAM
tok/s (32B)
Get Running in 3 Steps
From zero to serving models in under a minute
Install ZSE
One pip command to get started. No complex dependencies or configurations.
Pull a Pre-converted Model
Download ready-to-use .zse models from the hub. 13 models available.
Serve Your Model
Start the OpenAI-compatible API server with instant cold starts.
Query the API
Use the OpenAI-compatible API with any client library or framework.
Why ZSE?
ZSE solves a real problem: loading large models with bitsandbytes is slow because it quantizes on every load.
bitsandbytes (Standard)
Every time you load a model:
- Download FP16 weights (14GB for 7B model)
- Quantize to INT4 (takes 40+ seconds)
- Finally ready to use
.zse Format (Pre-quantized)
With ZSE, you quantize once, load instantly:
- One-time:
zse quantize→ .zse file - Every load: Read pre-quantized weights (instant)
- Ready in seconds, not minutes
ZSE vs llama.cpp (GGUF)
On Qwen 72B: ZSE loads in 6.5s vs llama.cpp GGUF in 10.2s — ZSE is 1.6× faster while using Python ecosystem.
When to use .zse: Production deployments, serverless functions, CI/CD pipelines, anywhere you need fast cold starts with the Python/HuggingFace ecosystem.
Verified Benchmarks
.zse v1.4.2 performance with bnb.matmul_4bit. Tested on NVIDIA H200.
| Model | File Size | Load Time | VRAM | Throughput |
|---|---|---|---|---|
| Qwen 7B | 5.57 GB | 9.1s | 5.9 GB | 58.7 tok/s |
| Qwen 32B | 19.23 GB | 24.1s | 20.9 GB | 26.9 tok/s |
32B Model Performance (v1.4.2)
Fits on 24GB consumer GPUs (RTX 3090/4090)
Built for Efficiency
Every feature designed for memory efficiency and fast cold starts
58.7 tok/s
Generate at 58.7 tokens/sec on 7B models. Real throughput using bitsandbytes CUDA kernels.
32B in 21GB VRAM
Run 32B models on 24GB consumer GPUs (RTX 3090/4090). True memory efficiency.
Multi-GPU TP & PP
Tensor Parallelism, Pipeline Parallelism, or both. Split models across GPUs automatically.
Built-in RAG (.zpf)
Semantic chunking, hybrid retrieval, 25% fewer tokens with .zpf compressed format.
9s Cold Start
Load 7B models in 9 seconds. Single .zse file with embedded config and tokenizer.
OpenAI Compatible
Drop-in replacement API. Works with LangChain, OpenAI SDK, and your existing code.
Perfect For
From local development to production deployments
Serverless Inference
Sub-5s cold starts make ZSE perfect for serverless deployments where every millisecond of startup time costs money.
Local AI Development
Run large models on your laptop. Test and iterate without cloud costs or API rate limits.
Edge Deployment
Memory-efficient enough for edge devices. Deploy AI at the edge without expensive hardware.
Cost Optimization
Fit larger models on smaller GPUs. Cut your cloud compute bills by up to 70%.
Model Hub
13 pre-converted INT4 .zse models ready for instant download from huggingface.co/zse-zllm
Don't see your model? Use zse quantize <HuggingFace-ID> to convert any compatible model.
Built-in RAG with .zpf Format
Semantic chunking, hybrid retrieval, and compressed .zpf format — 25% fewer tokens with zero quality loss.
# Add documents to a knowledge base
$ zse rag add ./docs/ --collection my-kb
# Search with hybrid retrieval
$ zse rag search "deployment guide" --collection my-kb
# Export as compressed .zpf (25% smaller)
$ zse rag export --collection my-kb --format zpfMulti-GPU: Tensor & Pipeline Parallelism
Split models across GPUs automatically. Use TP, PP, or both for maximum throughput and memory efficiency.
Tensor Parallelism (TP)
Split each layer across GPUs for lower latency. Best for 2-4 GPUs with fast interconnect.
zse serve ./model.zse -tp 2Pipeline Parallelism (PP)
Stack layers across GPUs for better memory scaling. Works on any interconnect.
zse serve ./model.zse -pp 4Hybrid TP+PP
Combine both strategies. TP within nodes, PP across nodes for large-scale deployment.
zse serve ./model.zse -tp 2 -pp 2Simple, Powerful API
Start serving models with just a few lines
# Install ZSE
$ pip install zllm-zse
# Pull a pre-converted model (instant download)
$ zse pull qwen-7b
# Serve with Multi-GPU Tensor Parallelism
$ zse serve ./model.zse --port 8000 -tp 2
# Add docs for RAG
$ zse rag add ./docs/ --collection my-kb
# Query the OpenAI-compatible API
$ curl localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"default","messages":[{"role":"user","content":"Hello!"}]}'Ready to Try ZSE?
Get memory-efficient LLM inference with fast cold starts. Install and start serving in minutes.