Ultra Memory-Efficient
LLM Inference

Run 32B models on 24GB GPUs.Multi-GPU TP & PP. Built-in RAG with .zpf format.

$pip install zllm-zse
0.0s

7B Cold Start

0.0

tok/s (7B)

0.0GB

32B VRAM

0.0

tok/s (32B)

Get Running in 3 Steps

From zero to serving models in under a minute

01

Install ZSE

One pip command to get started. No complex dependencies or configurations.

02

Pull a Pre-converted Model

Download ready-to-use .zse models from the hub. 13 models available.

03

Serve Your Model

Start the OpenAI-compatible API server with instant cold starts.

04

Query the API

Use the OpenAI-compatible API with any client library or framework.

terminal
$ pip install zllm-zse

Why ZSE?

ZSE solves a real problem: loading large models with bitsandbytes is slow because it quantizes on every load.

bitsandbytes (Standard)

Every time you load a model:

  1. Download FP16 weights (14GB for 7B model)
  2. Quantize to INT4 (takes 40+ seconds)
  3. Finally ready to use
Qwen 7B Load Time:45.4s

.zse Format (Pre-quantized)

With ZSE, you quantize once, load instantly:

  1. One-time: zse quantize → .zse file
  2. Every load: Read pre-quantized weights (instant)
  3. Ready in seconds, not minutes
Qwen 7B Load Time:3.9s

ZSE vs llama.cpp (GGUF)

On Qwen 72B: ZSE loads in 6.5s vs llama.cpp GGUF in 10.2s — ZSE is 1.6× faster while using Python ecosystem.

When to use .zse: Production deployments, serverless functions, CI/CD pipelines, anywhere you need fast cold starts with the Python/HuggingFace ecosystem.

Verified Benchmarks

.zse v1.4.2 performance with bnb.matmul_4bit. Tested on NVIDIA H200.

ModelFile SizeLoad TimeVRAMThroughput
Qwen 7B5.57 GB9.1s5.9 GB58.7 tok/s
Qwen 32B19.23 GB24.1s20.9 GB26.9 tok/s

32B Model Performance (v1.4.2)

Fits on 24GB consumer GPUs (RTX 3090/4090)

File Size19.23 GB
VRAM Usage20.9 GB
Throughput26.9 tok/s

Built for Efficiency

Every feature designed for memory efficiency and fast cold starts

58.7
tok/s

58.7 tok/s

Generate at 58.7 tokens/sec on 7B models. Real throughput using bitsandbytes CUDA kernels.

21
GB VRAM

32B in 21GB VRAM

Run 32B models on 24GB consumer GPUs (RTX 3090/4090). True memory efficiency.

TP+PP
parallelism

Multi-GPU TP & PP

Tensor Parallelism, Pipeline Parallelism, or both. Split models across GPUs automatically.

25%
fewer tokens

Built-in RAG (.zpf)

Semantic chunking, hybrid retrieval, 25% fewer tokens with .zpf compressed format.

9.1
seconds

9s Cold Start

Load 7B models in 9 seconds. Single .zse file with embedded config and tokenizer.

100%
Compatible

OpenAI Compatible

Drop-in replacement API. Works with LangChain, OpenAI SDK, and your existing code.

Perfect For

From local development to production deployments

Serverless Inference

Sub-5s cold starts make ZSE perfect for serverless deployments where every millisecond of startup time costs money.

Local AI Development

Run large models on your laptop. Test and iterate without cloud costs or API rate limits.

Edge Deployment

Memory-efficient enough for edge devices. Deploy AI at the edge without expensive hardware.

Cost Optimization

Fit larger models on smaller GPUs. Cut your cloud compute bills by up to 70%.

Model Hub

13 pre-converted INT4 .zse models ready for instant download from huggingface.co/zse-zllm

ModelCategorySizePull CommandHF
Qwen2.5-0.5B-InstructChat0.69 GBzse pull qwen-0.5b
TinyLlama-1.1B-ChatChat0.71 GBzse pull tinyllama-1.1b
Qwen2.5-1.5B-InstructChat1.51 GBzse pull qwen-1.5b
Qwen2.5-Coder-1.5BCode1.51 GBzse pull qwen-coder-1.5b
Qwen2.5-3B-InstructChat2.51 GBzse pull qwen-3b
DeepSeek-Coder-6.7BCode3.61 GBzse pull deepseek-6.7b
Mistral-7B-InstructChat3.86 GBzse pull mistral-7b
Qwen2.5-7B-InstructChat/Code5.18 GBzse pull qwen-7b
Qwen2.5-Coder-7BCode5.18 GBzse pull qwen-coder-7b
Qwen2.5-14B-InstructChat9.26 GBzse pull qwen-14b
Qwen2.5-32B-InstructChat/Code17.9 GBzse pull qwen-32b
Mixtral-8x7B-InstructChat85.14 GBzse pull mixtral-8x7b
Qwen2.5-72B-InstructChat/Code38.38 GBzse pull qwen-72b

Don't see your model? Use zse quantize <HuggingFace-ID> to convert any compatible model.

Built-in RAG with .zpf Format

Semantic chunking, hybrid retrieval, and compressed .zpf format — 25% fewer tokens with zero quality loss.

25%
Fewer Tokens
Compressed .zpf format reduces token usage
10/10
Cost Benchmark
Perfect score on cost-aware eval
12.5/15
Retrieval v3 Eval
High-quality relevant context
Hybrid
Retrieval Mode
Semantic + keyword search combined
RAG commands
# Add documents to a knowledge base
$ zse rag add ./docs/ --collection my-kb

# Search with hybrid retrieval
$ zse rag search "deployment guide" --collection my-kb

# Export as compressed .zpf (25% smaller)
$ zse rag export --collection my-kb --format zpf

Multi-GPU: Tensor & Pipeline Parallelism

Split models across GPUs automatically. Use TP, PP, or both for maximum throughput and memory efficiency.

Tensor Parallelism (TP)

Split each layer across GPUs for lower latency. Best for 2-4 GPUs with fast interconnect.

zse serve ./model.zse -tp 2

Pipeline Parallelism (PP)

Stack layers across GPUs for better memory scaling. Works on any interconnect.

zse serve ./model.zse -pp 4

Hybrid TP+PP

Combine both strategies. TP within nodes, PP across nodes for large-scale deployment.

zse serve ./model.zse -tp 2 -pp 2

Simple, Powerful API

Start serving models with just a few lines

terminal
# Install ZSE
$ pip install zllm-zse

# Pull a pre-converted model (instant download)
$ zse pull qwen-7b

# Serve with Multi-GPU Tensor Parallelism
$ zse serve ./model.zse --port 8000 -tp 2

# Add docs for RAG
$ zse rag add ./docs/ --collection my-kb

# Query the OpenAI-compatible API
$ curl localhost:8000/v1/chat/completions \
    -H "Content-Type: application/json" \
    -d '{"model":"default","messages":[{"role":"user","content":"Hello!"}]}'

Ready to Try ZSE?

Get memory-efficient LLM inference with fast cold starts. Install and start serving in minutes.

Apache 2.0 Licensed
Open Source
PyPI Published