The Microsecond War — Optimizing Higgs Audio v3 TTS Inference

SGLang vs SGLang-Omni: TTS Is Not an LLM

Same Qwen3 backbone inside — but wrapped in a neural codec, a multi-codebook delay schedule, and a waveform decoder. A completely different serving problem.

SGLANG · LLM INFERENCE

One model, one pattern

text in → text out

TRANSFORMER

tok tok tok tok

→ user, directly

Single stage. Prefill + decode on one transformer; one scheduler owns the GPU.
The token is the product — a sampled token goes straight to the user.
Chunking is free. Stream by flushing tokens as they sample.
One head, one vocabulary — one token per step.
Latency is dominated by the backbone forward pass.

SGLANG-OMNI · TTS INFERENCE

A pipeline, not a model

text + reference voice in → waveform out

ENCODER ▸ AR ×8 HEADS ▸

▸ VOCODER ▸

Four heterogeneous stages — different compute, latency, and memory behavior.
The token is an intermediate — worthless until a second GPU model decodes it.
Chunking is dangerous — naive splits click audibly at every seam.
Eight codebooks per step, offset by a delay pattern — a state machine.
Latency is dominated by per-step overhead × 400–800 steps.

The consequence: bolting an encoder and vocoder onto an LLM server's request loop serializes everything. The fix is architectural: first-class stages, run asynchronously.

Why Multi-Stage Async Wins

Each stage becomes an independent process with its own scheduler. Same workload, two designs — the playhead is wall-clock time.

✕ MONOLITHIC / SYNCHRONOUS — B cannot start until A's whole pipeline drains

✓ MULTI-STAGE ASYNC — encoder, AR engine and vocoder each work on a different request

encoder AR decode vocoder not yet run

Same three requests, same per-stage costs. The async pipeline finishes all three before the synchronous design finishes two — stage overlap is pure recovered wall-clock.

BENEFIT / 01

Right scheduler per stage

The AR stage runs OmniScheduler with SGLang's continuous batching, KV cache and CUDA graphs; preprocessing gets a ThreadedSimpleScheduler thread pool; the encoder a single-threaded SimpleScheduler; the vocoder a StreamingVocoderScheduler.

BENEFIT / 02

Memory as a contract

Each stage on a GPU declares total_gpu_memory_fraction; placement validation sums the budget per GPU before startup. Encoder, AR engine, and vocoder coexist on one H100 without starving each other.

BENEFIT / 03

Streaming-aware data movement

Control messages travel a ZMQ/msgpack control plane; tensors move on a relay data plane (shm, nccl, nixl, mooncake), with CUDA IPC for same-GPU chunks.

The Four Stages — and What Each One Got

Follow the green pulse: four stations, four different bottlenecks, four different cures.

CPU · I/O BOUND

1 · Preprocessing

Tokenizes text, loads the reference audio. Negligible FLOPs — but every stage hop costs a queue and a serialization boundary.

text + ref audio → tensors

ThreadedSimpleScheduler

GPU · STATIC COMPUTE

2 · Audio Encoder

HiggsAudioCodec compresses the reference waveform into discrete tokens — the voice-cloning prompt. 50–100ms, deterministic per input.

waveform → [T, 8] @ 25fps

LRU cachebatched encodertorch.compile

GPU · AUTOREGRESSIVE

3 · AR Engine

Qwen3-4B backbone emits 8 codebook tokens per step under the delay pattern. 400–800 steps per 10s of speech.

step-by-step → [B, 8, V]

CUDA graphradix cachepaged attentionasync decode

GPU · FAST BUT QUEUE-PRONE

4 · Vocoder

The DAC decoder inverts tokens back into waveform. ~10ms per call, but every AR loop converges here.

tokens → waveform @ 25fps

batched decodewindowed streaming

One structure rules them all: the delay pattern

Watch the cursor: at each step, codebook i only starts emitting i steps after codebook 0 (green — pitch and intonation). The staircase repeats forever.

It defines a four-phase per-request state machine — delay → active → wind-down → finished — that the encoder must emit, the AR engine must advance every step, and the vocoder must invert. Each deep dive below collides with it.

codebook 0 active waiting on delay

Deep Dives, Stage by Stage

PR numbers link to the public roadmap, sglang-omni #478: "low latency and high throughput with no model performance reduction."

Preprocessing — CPU work gets a CPU scheduler

CPU · tokenization + reference-audio I/O · ThreadedSimpleScheduler

Tokenize text, load and decode the reference audio. Pure CPU + I/O work — it needs none of the GPU scheduler's machinery (batching, KV cache, CUDA graphs).

So it runs under a CPU-side ThreadedSimpleScheduler — a thread pool (max_concurrency) that preprocesses several requests in parallel while the GPU stages work on earlier ones.

work text → token ids · audio file → waveform tensor
cost I/O-bound, negligible FLOPs
scheduler ThreadedSimpleScheduler — CPU thread pool, no GPU machinery

SCHEDPipelined off the GPU's critical path

The thread pool preprocesses requests in parallel; while the GPU serves request 1, requests 2–4 are already prepped — the GPU never waits on I/O.

Encoder — the fastest compute is no compute

LRU cache · batched encoding · torch.compile

The encoder is expensive (50–100ms), deterministic, and repetitive — users reuse a handful of voices across thousands of prompts. Most encoder work is recomputation of known answers, so: cache it.

arch DAC-style codec, weights shared with vocoder
out [T, 8] tokens → rearranged into delay pattern
key fact deterministic per input → cacheable

LRULRU cache keyed by reference audio

Seen this voice before? Return the precomputed tokens — 50–100ms saved per hit, and the GPU stays free for AR decode.

Request 2 reuses voice A: the 80ms encode collapses to a cache lookup. Voice B misses and pays full price.

#562 #563 #605

COMPILEtorch.compile on the encoder

Static shapes, a convolutional stack, no branching — the textbook compile workload. Inductor fuses the small kernels into fewer, larger ones.

Same math, fewer launches, earlier finish. On a cache miss this stacks with the LRU above.

#612

AR Engine — paged attention, radix cache, CUDA graphs

Qwen3-4B backbone · fused 8-codebook head · 400–800 steps per 10s of speech

Each step: backbone forward → fused 8-head projection [B, 8, V] → sample → advance the delay state machine → report to CPU.

At 400–800 steps per request the enemy is per-step fixed cost, not FLOPs: ~0.1ms of launch + sync overhead per step is ~80ms wasted per request.

per-step forward · 8-head projection · sample · D2H
state machine delay → active → wind-down → finished
enemy launch + sync overhead × ~600 steps

PAGEDPaged attention for free

KV lives in fixed-size pages — no fragmentation, no over-reservation. That's what lets 16 AR loops share one H100 with the encoder and vocoder. Inherited free by reusing OmniScheduler.

contiguous

paged

A finishes, D needs 10 blocks: contiguous rejects (free but fragmented); paged scatters D across whatever is free.

#476

RADIXRadix cache, partitioned by voice

RadixAttention keeps prior KV in a prefix tree; in TTS the shared prefix is the reference voice, namespaced via extra_key. It stacks with the encoder LRU: a repeated voice skips both the encode and the prefill.

Same shape as the LRU figure — that's the point: the two caches stack.

CONCEPTWhat a CUDA graph actually is

Each kernel launch costs the CPU 5–10µs; a step of dozens of tiny kernels leaves the GPU starving between them. A CUDA graph records the sequence once and replays it as a single launch:

✕ EAGER — CPU launches every kernel; GPU starves between launches

✓ CUDA GRAPH — one launch, kernels packed back-to-back

The recording is frozen — so the delay-pattern state machine was rewritten as branchless, in-place tensor ops on fixed buffers: control flow turned into data flow. (#503)

ASYNCAsync decode — launch-first one-step lookahead

Every step ends with a heavy host-side collect: sampler-state scatter, delay-pattern bookkeeping, EOC/finish handling, and a D2H of the codes snapshot — in Python, every step. The GPU used to sit idle through all of it.

The fix splits execute() in two: launch (forward + on-GPU sample + async D2H into a pinned ping-pong buffer + CUDA event) and resolve (the host collect). The loop runs launch(N) then resolve(N−1) — step N−1's CPU collect hides under step N's forward:

✕ SYNC — GPU idles through every step's host collect

✓ ASYNC LOOKAHEAD — forwards run back-to-back; collect(N−1) overlaps forward(N)

Output is bit-identical ON vs OFF; bs=1 takes a sync fast path (break-even by design — nothing to overlap). Full SeedTTS-EN @ concurrency 16: throughput +12.7%, mean latency −16.1%, RTF p99 −39% — the win grows with batch size and tail length.

#590

Vocoder — streaming without the clicks

DAC decoder · ~10ms per call · batched decode + windowed streaming

Queueing: every AR loop converges here — 16 requests decoded serially leave the last waiting 240ms.

Streaming: audio can't be chunked like text — the codec carries state across time (chunks click at every seam), and mid-stream delay rows are incomplete (decoding them injects noise).

stride 75 frames — accumulate, then decode
overlap 8 frames — re-decode boundary + crossfade
hold-back 4 frames — incomplete tail rows wait
floor ≥ N rows to invert the pattern → TTFB ≈ 300–400ms

BATCHBatched vocoder decode

Collect requests in a 2ms window, zero-pad, decode as one decode_batch() call, trim to true length.

Serial: the last request queues behind five others. Batched: everyone leaves in one GPU call.

#569 #574

RACENon-streaming vs streaming — same request

Measured against api.boson.ai: first audio at ~0.7s streaming vs ~1.5s non-streaming. Same generation cost — the difference is when sound starts.

STREAMThe window protocol, animated

codes arriving

decode window

emitted audio

Cyan leading edge = crossfade overlap; dashed cells = held-back tail. Real parameters 75 / 8 / 4, scaled for display.

#597 #614

Irreducible startup: the vocoder needs ≥ N rows to invert the pattern → streaming TTFB ≈ 300–400ms, under the 500ms conversational bar. Verify it in the demo below.

Live Demo: Clone Your Voice

Record your voice as the reference audio, type a sentence, and race non-streaming against streaming. Calls go directly from your browser to api.boson.ai · API docs.

POST /v1/audio/speech · model: higgs-audio-v3-tts

STEP 1 · REFERENCE VOICE3–15s of speech works best · skip to use the default voice

STEP 2 · TEXT TO SPEAK

STEP 3 · GENERATE — RACE THE TWO PATHS

NON-STREAMING

STREAMING (24kHz PCM)

// audio is sent to the Boson AI API for synthesis and not stored by this page · mic access stays in your browser

Overall Performance: One H100

Full Seed-TTS EN set (N=1088, mean of 3 runs), bf16 + CUDA graph, max_running_requests=16. RTF < 1 means faster than real time.

Concurrency	Throughput req/s	Avg latency	RTF / req	audio s/s
1	1.62	617 ms	0.147	6.89
2	2.70	742 ms	0.180	11.37
4	5.45	733 ms	0.177	22.84
8	8.91	898 ms	0.217	37.38
16	14.74	1079 ms	0.262	61.84

// 9× the audio throughput from 1→16 concurrency while per-request latency grows only 1.7× — the multi-stage async pipeline absorbing load.

AUDIO THROUGHPUT — SECONDS OF AUDIO PRODUCED PER SECOND

concurrency 16.89

concurrency 211.37

concurrency 422.84

concurrency 837.38

concurrency 1661.84

RTF 0.147 → 0.26216× the load, still far below real time

AND QUALITY HELD — WER/CER ACROSS PUBLIC BENCHMARKS

1.11

Seed-TTS · 2 languages

4.41

CV3 · 9 languages

2.74

MiniMax-Multilingual · 23 languages

3.61

Higgs-Multilingual · 111 languages & dialects

// Plus inline control tags: 20+ emotions, styles (singing, whispering, shouting), prosody (speed, pitch, pauses), and sound events (laughter, sigh, cough) — all composable.

"The best system optimizations come not from blindly applying popular techniques,
but from deep theoretical understanding, the desire to build elegant systems,
and clear-eyed engineering trade-offs."
— TTS OPTIMIZATION NOTES · SGLANG-OMNI