How Higgs Audio v3 TTS is served on SGLang-Omni. TTS inference looks like LLM inference but breaks every one of its assumptions — so the system was rebuilt as a multi-stage asynchronous pipeline, then each stage was attacked with its own weapon: LRU caching, CUDA graphs, radix caching, and windowed streaming vocoding. Every diagram on this page is animated — and there's a live voice-cloning demo at the end.
Same Qwen3 backbone inside — but wrapped in a neural codec, a multi-codebook delay schedule, and a waveform decoder. A completely different serving problem.
Each stage becomes an independent process with its own scheduler. Same workload, two designs — the playhead is wall-clock time.
Same three requests, same per-stage costs. The async pipeline finishes all three before the synchronous design finishes two — stage overlap is pure recovered wall-clock.
The AR stage runs OmniScheduler with SGLang's continuous batching, KV cache and CUDA graphs; preprocessing gets a ThreadedSimpleScheduler thread pool; the encoder a single-threaded SimpleScheduler; the vocoder a StreamingVocoderScheduler.
Each stage on a GPU declares total_gpu_memory_fraction; placement validation sums the budget per GPU before startup. Encoder, AR engine, and vocoder coexist on one H100 without starving each other.
Control messages travel a ZMQ/msgpack control plane; tensors move on a relay data plane (shm, nccl, nixl, mooncake), with CUDA IPC for same-GPU chunks.
Follow the green pulse: four stations, four different bottlenecks, four different cures.
Tokenizes text, loads the reference audio. Negligible FLOPs — but every stage hop costs a queue and a serialization boundary.
text + ref audio → tensorsHiggsAudioCodec compresses the reference waveform into discrete tokens — the voice-cloning prompt. 50–100ms, deterministic per input.
waveform → [T, 8] @ 25fpsQwen3-4B backbone emits 8 codebook tokens per step under the delay pattern. 400–800 steps per 10s of speech.
step-by-step → [B, 8, V]The DAC decoder inverts tokens back into waveform. ~10ms per call, but every AR loop converges here.
tokens → waveform @ 25fpsWatch the cursor: at each step, codebook i only starts emitting i steps after codebook 0 (green — pitch and intonation). The staircase repeats forever.
It defines a four-phase per-request state machine — delay → active → wind-down → finished — that the encoder must emit, the AR engine must advance every step, and the vocoder must invert. Each deep dive below collides with it.
PR numbers link to the public roadmap, sglang-omni #478: "low latency and high throughput with no model performance reduction."
Tokenize text, load and decode the reference audio. Pure CPU + I/O work — it needs none of the GPU scheduler's machinery (batching, KV cache, CUDA graphs).
So it runs under a CPU-side ThreadedSimpleScheduler — a thread pool (max_concurrency) that preprocesses several requests in parallel while the GPU stages work on earlier ones.
The thread pool preprocesses requests in parallel; while the GPU serves request 1, requests 2–4 are already prepped — the GPU never waits on I/O.
The encoder is expensive (50–100ms), deterministic, and repetitive — users reuse a handful of voices across thousands of prompts. Most encoder work is recomputation of known answers, so: cache it.
Seen this voice before? Return the precomputed tokens — 50–100ms saved per hit, and the GPU stays free for AR decode.
Request 2 reuses voice A: the 80ms encode collapses to a cache lookup. Voice B misses and pays full price.
Static shapes, a convolutional stack, no branching — the textbook compile workload. Inductor fuses the small kernels into fewer, larger ones.
Same math, fewer launches, earlier finish. On a cache miss this stacks with the LRU above.
Each step: backbone forward → fused 8-head projection [B, 8, V] → sample → advance the delay state machine → report to CPU.
At 400–800 steps per request the enemy is per-step fixed cost, not FLOPs: ~0.1ms of launch + sync overhead per step is ~80ms wasted per request.
KV lives in fixed-size pages — no fragmentation, no over-reservation. That's what lets 16 AR loops share one H100 with the encoder and vocoder. Inherited free by reusing OmniScheduler.
A finishes, D needs 10 blocks: contiguous rejects (free but fragmented); paged scatters D across whatever is free.
RadixAttention keeps prior KV in a prefix tree; in TTS the shared prefix is the reference voice, namespaced via extra_key. It stacks with the encoder LRU: a repeated voice skips both the encode and the prefill.
Same shape as the LRU figure — that's the point: the two caches stack.
Each kernel launch costs the CPU 5–10µs; a step of dozens of tiny kernels leaves the GPU starving between them. A CUDA graph records the sequence once and replays it as a single launch:
The recording is frozen — so the delay-pattern state machine was rewritten as branchless, in-place tensor ops on fixed buffers: control flow turned into data flow. (#503)
Every step ends with a heavy host-side collect: sampler-state scatter, delay-pattern bookkeeping, EOC/finish handling, and a D2H of the codes snapshot — in Python, every step. The GPU used to sit idle through all of it.
The fix splits execute() in two: launch (forward + on-GPU sample + async D2H into a pinned ping-pong buffer + CUDA event) and resolve (the host collect). The loop runs launch(N) then resolve(N−1) — step N−1's CPU collect hides under step N's forward:
Output is bit-identical ON vs OFF; bs=1 takes a sync fast path (break-even by design — nothing to overlap). Full SeedTTS-EN @ concurrency 16: throughput +12.7%, mean latency −16.1%, RTF p99 −39% — the win grows with batch size and tail length.
Queueing: every AR loop converges here — 16 requests decoded serially leave the last waiting 240ms.
Streaming: audio can't be chunked like text — the codec carries state across time (chunks click at every seam), and mid-stream delay rows are incomplete (decoding them injects noise).
Collect requests in a 2ms window, zero-pad, decode as one decode_batch() call, trim to true length.
Serial: the last request queues behind five others. Batched: everyone leaves in one GPU call.
Measured against api.boson.ai: first audio at ~0.7s streaming vs ~1.5s non-streaming. Same generation cost — the difference is when sound starts.
Cyan leading edge = crossfade overlap; dashed cells = held-back tail. Real parameters 75 / 8 / 4, scaled for display.
Record your voice as the reference audio, type a sentence, and race non-streaming against
streaming. Calls go directly from your browser to api.boson.ai ·
API docs.
Full Seed-TTS EN set (N=1088, mean of 3 runs), bf16 + CUDA graph, max_running_requests=16. RTF < 1 means faster than real time.
| Concurrency | Throughput req/s | Avg latency | RTF / req | audio s/s |
|---|---|---|---|---|
| 1 | 1.62 | 617 ms | 0.147 | 6.89 |
| 2 | 2.70 | 742 ms | 0.180 | 11.37 |
| 4 | 5.45 | 733 ms | 0.177 | 22.84 |
| 8 | 8.91 | 898 ms | 0.217 | 37.38 |
| 16 | 14.74 | 1079 ms | 0.262 | 61.84 |
// 9× the audio throughput from 1→16 concurrency while per-request latency grows only 1.7× — the multi-stage async pipeline absorbing load.
// Plus inline control tags: 20+ emotions, styles (singing, whispering, shouting), prosody (speed, pitch, pauses), and sound events (laughter, sigh, cough) — all composable.
"The best system optimizations come not from blindly applying popular techniques,
but from deep theoretical understanding, the desire to build elegant systems,
and clear-eyed engineering trade-offs."