Getting Started

Introduction

ZSE (Z Server Engine) is an ultra memory-efficient LLM inference engine designed for fast cold starts and low memory usage.

What is ZSE?

ZSE is an inference engine that loads large language models in seconds, not minutes. It achieves this through pre-quantized model formats that skip runtime quantization entirely.

Why ZSE?

Traditional inference engines like vLLM and transformers spend 30-60 seconds quantizing models at startup. ZSE's .zse format eliminates this overhead, enabling sub-4-second cold starts for 7B models.

Whether you're building serverless AI endpoints, developing locally, or deploying to production, ZSE helps you iterate faster and reduce costs.

Key Features

zQuantize

Pre-quantize models to INT4/NF4 format for instant loading

zServe

OpenAI-compatible API server with streaming support

zInfer

CLI tool for quick model testing and inference

zStream

Layer streaming for running large models on limited VRAM

zKV

Quantized KV cache for 4× memory savings

OpenAI API

Drop-in replacement for OpenAI's chat completions API

9.1s cold start for 7B models with embedded config/tokenizer
24.1s cold start for 32B models
58.7 tok/s (7B) and 26.9 tok/s (32B) with bnb.matmul_4bit
32B fits on 24GB consumer GPUs (RTX 3090/4090)
Single .zse file format - no network calls needed
Streaming token generation

Benchmarks

Performance benchmarks on H200 with Qwen 2.5 Instruct models (v1.2.0):

Model	File Size	VRAM	Speed	Load Time
7B	5.57 GB	5.9 GB	58.7 tok/s	9.1s
32B	19.23 GB	20.9 GB	26.9 tok/s	24.1s

32B models now fit on 24GB consumer GPUs (RTX 3090/4090) thanks to bnb.matmul_4bit integration.

Quick Install

Install ZSE from PyPI:

bash

pip install zllm-zse

For GGUF model support, install with the optional dependency:

bash

pip install zllm-zse[gguf]

Start a server with a pre-trained model:

bash

# Start the server
zse serve Qwen/Qwen2.5-7B-Instruct
 
# Or with custom settings
zse serve Qwen/Qwen2.5-7B-Instruct --port 8080 --host 0.0.0.0

Introduction

What is ZSE?

Key Features

zQuantize

zServe

zInfer

zStream

zKV

OpenAI API

Benchmarks

Quick Install

Next Steps

Quick Start

Installation

API Reference