Getting Started

Introduction

ZSE (Z Server Engine) is an ultra memory-efficient LLM inference engine designed for fast cold starts and low memory usage.

What is ZSE?

ZSE is an inference engine that loads large language models in seconds, not minutes. It achieves this through pre-quantized model formats that skip runtime quantization entirely.

Why ZSE?

Traditional inference engines like vLLM and transformers spend 30-60 seconds quantizing models at startup. ZSE's .zse format eliminates this overhead, enabling sub-4-second cold starts for 7B models.

Whether you're building serverless AI endpoints, developing locally, or deploying to production, ZSE helps you iterate faster and reduce costs.

Key Features

zQuantize

Pre-quantize models to INT4/NF4 format for instant loading

zServe

OpenAI-compatible API server with streaming support

zInfer

CLI tool for quick model testing and inference

zStream

Layer streaming for running large models on limited VRAM

zKV

Quantized KV cache for 4× memory savings

OpenAI API

Drop-in replacement for OpenAI's chat completions API

  • 9.1s cold start for 7B models with embedded config/tokenizer
  • 24.1s cold start for 32B models
  • 58.7 tok/s (7B) and 26.9 tok/s (32B) with bnb.matmul_4bit
  • 32B fits on 24GB consumer GPUs (RTX 3090/4090)
  • Single .zse file format - no network calls needed
  • Streaming token generation

Benchmarks

Performance benchmarks on H200 with Qwen 2.5 Instruct models (v1.2.0):

ModelFile SizeVRAMSpeedLoad Time
7B5.57 GB5.9 GB58.7 tok/s9.1s
32B19.23 GB20.9 GB26.9 tok/s24.1s
32B models now fit on 24GB consumer GPUs (RTX 3090/4090) thanks to bnb.matmul_4bit integration.

Quick Install

Install ZSE from PyPI:

bash
pip install zllm-zse

For GGUF model support, install with the optional dependency:

bash
pip install zllm-zse[gguf]

Start a server with a pre-trained model:

bash
# Start the server
zse serve Qwen/Qwen2.5-7B-Instruct
# Or with custom settings
zse serve Qwen/Qwen2.5-7B-Instruct --port 8080 --host 0.0.0.0

Next Steps