v1.4.2 released — Built-in RAG, Multi-GPU TP/PP, Model Hub!

Ultra Memory-Efficient
LLM Inference

Run 32B models on 24GB GPUs.Multi-GPU TP & PP. Built-in RAG with .zpf format.

$pip install zllm-zse

Get Started

View on GitHub

0.0s

7B Cold Start

0.0

tok/s (7B)

0.0GB

32B VRAM

0.0

tok/s (32B)

Get Running in 3 Steps

From zero to serving models in under a minute

Install ZSE

One pip command to get started. No complex dependencies or configurations.

Pull a Pre-converted Model

Download ready-to-use .zse models from the hub. 13 models available.

Serve Your Model

Start the OpenAI-compatible API server with instant cold starts.

Query the API

Use the OpenAI-compatible API with any client library or framework.

terminal

$ pip install zllm-zse

Why ZSE?

ZSE solves a real problem: loading large models with bitsandbytes is slow because it quantizes on every load.

bitsandbytes (Standard)

Every time you load a model:

Download FP16 weights (14GB for 7B model)
Quantize to INT4 (takes 40+ seconds)
Finally ready to use

Qwen 7B Load Time:45.4s

.zse Format (Pre-quantized)

With ZSE, you quantize once, load instantly:

One-time: zse quantize → .zse file
Every load: Read pre-quantized weights (instant)
Ready in seconds, not minutes

Qwen 7B Load Time:3.9s

ZSE vs llama.cpp (GGUF)

On Qwen 72B: ZSE loads in 6.5s vs llama.cpp GGUF in 10.2s — ZSE is 1.6× faster while using Python ecosystem.

When to use .zse: Production deployments, serverless functions, CI/CD pipelines, anywhere you need fast cold starts with the Python/HuggingFace ecosystem.

Verified Benchmarks

.zse v1.4.2 performance with bnb.matmul_4bit. Tested on NVIDIA H200.

Model	File Size	Load Time	VRAM	Throughput
Qwen 7B	5.57 GB	9.1s	5.9 GB	58.7 tok/s
Qwen 32B	19.23 GB	24.1s	20.9 GB	26.9 tok/s

32B Model Performance (v1.4.2)

Fits on 24GB consumer GPUs (RTX 3090/4090)

File Size19.23 GB

VRAM Usage20.9 GB

Throughput26.9 tok/s

Built for Efficiency

Every feature designed for memory efficiency and fast cold starts

58.7

tok/s

58.7 tok/s

Generate at 58.7 tokens/sec on 7B models. Real throughput using bitsandbytes CUDA kernels.

GB VRAM

32B in 21GB VRAM

Run 32B models on 24GB consumer GPUs (RTX 3090/4090). True memory efficiency.

TP+PP

parallelism

Multi-GPU TP & PP

Tensor Parallelism, Pipeline Parallelism, or both. Split models across GPUs automatically.

25%

fewer tokens

Built-in RAG (.zpf)

Semantic chunking, hybrid retrieval, 25% fewer tokens with .zpf compressed format.

9.1

seconds

9s Cold Start

Load 7B models in 9 seconds. Single .zse file with embedded config and tokenizer.

100%

Compatible

OpenAI Compatible

Drop-in replacement API. Works with LangChain, OpenAI SDK, and your existing code.

Perfect For

From local development to production deployments

Serverless Inference

Sub-5s cold starts make ZSE perfect for serverless deployments where every millisecond of startup time costs money.

Local AI Development

Run large models on your laptop. Test and iterate without cloud costs or API rate limits.

Edge Deployment

Memory-efficient enough for edge devices. Deploy AI at the edge without expensive hardware.

Cost Optimization

Fit larger models on smaller GPUs. Cut your cloud compute bills by up to 70%.

Model Hub

13 pre-converted INT4 .zse models ready for instant download from huggingface.co/zse-zllm

Model	Category	Size	Pull Command
Qwen2.5-0.5B-Instruct	Chat	0.69 GB	`zse pull qwen-0.5b`
TinyLlama-1.1B-Chat	Chat	0.71 GB	`zse pull tinyllama-1.1b`
Qwen2.5-1.5B-Instruct	Chat	1.51 GB	`zse pull qwen-1.5b`
Qwen2.5-Coder-1.5B	Code	1.51 GB	`zse pull qwen-coder-1.5b`
Qwen2.5-3B-Instruct	Chat	2.51 GB	`zse pull qwen-3b`
DeepSeek-Coder-6.7B	Code	3.61 GB	`zse pull deepseek-6.7b`
Mistral-7B-Instruct	Chat	3.86 GB	`zse pull mistral-7b`
Qwen2.5-7B-Instruct	Chat/Code	5.18 GB	`zse pull qwen-7b`
Qwen2.5-Coder-7B	Code	5.18 GB	`zse pull qwen-coder-7b`
Qwen2.5-14B-Instruct	Chat	9.26 GB	`zse pull qwen-14b`
Qwen2.5-32B-Instruct	Chat/Code	17.9 GB	`zse pull qwen-32b`
Mixtral-8x7B-Instruct	Chat	85.14 GB	`zse pull mixtral-8x7b`
Qwen2.5-72B-Instruct	Chat/Code	38.38 GB	`zse pull qwen-72b`

Don't see your model? Use zse quantize <HuggingFace-ID> to convert any compatible model.

Built-in RAG with .zpf Format

Semantic chunking, hybrid retrieval, and compressed .zpf format — 25% fewer tokens with zero quality loss.

25%

Fewer Tokens

Compressed .zpf format reduces token usage

10/10

Cost Benchmark

Perfect score on cost-aware eval

12.5/15

Retrieval v3 Eval

High-quality relevant context

Hybrid

Retrieval Mode

Semantic + keyword search combined

RAG commands

# Add documents to a knowledge base
$ zse rag add ./docs/ --collection my-kb

# Search with hybrid retrieval
$ zse rag search "deployment guide" --collection my-kb

# Export as compressed .zpf (25% smaller)
$ zse rag export --collection my-kb --format zpf

Multi-GPU: Tensor & Pipeline Parallelism

Split models across GPUs automatically. Use TP, PP, or both for maximum throughput and memory efficiency.

Tensor Parallelism (TP)

Split each layer across GPUs for lower latency. Best for 2-4 GPUs with fast interconnect.

zse serve ./model.zse -tp 2

Pipeline Parallelism (PP)

Stack layers across GPUs for better memory scaling. Works on any interconnect.

zse serve ./model.zse -pp 4

Hybrid TP+PP

Combine both strategies. TP within nodes, PP across nodes for large-scale deployment.

zse serve ./model.zse -tp 2 -pp 2

Simple, Powerful API

Start serving models with just a few lines