Boson AI · TTS Inference Optimization · 2026

The Microsecond WarMaking Speech Faster Than Sound

How Higgs Audio v3 TTS is served on SGLang-Omni. TTS inference looks like LLM inference but breaks every one of its assumptions — so the system was rebuilt as a multi-stage asynchronous pipeline, then each stage was attacked with its own weapon: LRU caching, CUDA graphs, radix caching, and windowed streaming vocoding. Every diagram on this page is animated — and there's a live voice-cloning demo at the end.

0.147RTF
PER-REQUEST · FAR FASTER THAN REAL-TIME
61.8audio s/s
SINGLE H100 · 16-WAY CONCURRENCY
~350ms
STREAMING TTFB · UNDER THE 500ms BAR
111langs
SINGLE-DIGIT WER/CER ACROSS LANGUAGES
01

SGLang vs SGLang-Omni: TTS Is Not an LLM

Same Qwen3 backbone inside — but wrapped in a neural codec, a multi-codebook delay schedule, and a waveform decoder. A completely different serving problem.

SGLANG · LLM INFERENCE

One model, one pattern

text in → text out
TRANSFORMER
tok tok tok tok
→ user, directly
  • Single stage. Prefill + decode on one transformer; one scheduler owns the GPU.
  • The token is the product — a sampled token goes straight to the user.
  • Chunking is free. Stream by flushing tokens as they sample.
  • One head, one vocabulary — one token per step.
  • Latency is dominated by the backbone forward pass.
SGLANG-OMNI · TTS INFERENCE

A pipeline, not a model

text + reference voice in → waveform out
ENCODER AR ×8 HEADS
VOCODER
  • Four heterogeneous stages — different compute, latency, and memory behavior.
  • The token is an intermediate — worthless until a second GPU model decodes it.
  • Chunking is dangerous — naive splits click audibly at every seam.
  • Eight codebooks per step, offset by a delay pattern — a state machine.
  • Latency is dominated by per-step overhead × 400–800 steps.
The consequence: bolting an encoder and vocoder onto an LLM server's request loop serializes everything. The fix is architectural: first-class stages, run asynchronously.
02

Why Multi-Stage Async Wins

Each stage becomes an independent process with its own scheduler. Same workload, two designs — the playhead is wall-clock time.

✕ MONOLITHIC / SYNCHRONOUS — B cannot start until A's whole pipeline drains
✓ MULTI-STAGE ASYNC — encoder, AR engine and vocoder each work on a different request
encoder AR decode vocoder not yet run

Same three requests, same per-stage costs. The async pipeline finishes all three before the synchronous design finishes two — stage overlap is pure recovered wall-clock.

BENEFIT / 01

Right scheduler per stage

The AR stage runs OmniScheduler with SGLang's continuous batching, KV cache and CUDA graphs; preprocessing gets a ThreadedSimpleScheduler thread pool; the encoder a single-threaded SimpleScheduler; the vocoder a StreamingVocoderScheduler.

BENEFIT / 02

Memory as a contract

Each stage on a GPU declares total_gpu_memory_fraction; placement validation sums the budget per GPU before startup. Encoder, AR engine, and vocoder coexist on one H100 without starving each other.

BENEFIT / 03

Streaming-aware data movement

Control messages travel a ZMQ/msgpack control plane; tensors move on a relay data plane (shm, nccl, nixl, mooncake), with CUDA IPC for same-GPU chunks.

03

The Four Stages — and What Each One Got

Follow the green pulse: four stations, four different bottlenecks, four different cures.

CPU · I/O BOUND

1 · Preprocessing

Tokenizes text, loads the reference audio. Negligible FLOPs — but every stage hop costs a queue and a serialization boundary.

text + ref audio → tensors
ThreadedSimpleScheduler
GPU · STATIC COMPUTE

2 · Audio Encoder

HiggsAudioCodec compresses the reference waveform into discrete tokens — the voice-cloning prompt. 50–100ms, deterministic per input.

waveform → [T, 8] @ 25fps
LRU cachebatched encodertorch.compile
GPU · AUTOREGRESSIVE

3 · AR Engine

Qwen3-4B backbone emits 8 codebook tokens per step under the delay pattern. 400–800 steps per 10s of speech.

step-by-step → [B, 8, V]
CUDA graphradix cachepaged attentionasync decode
GPU · FAST BUT QUEUE-PRONE

4 · Vocoder

The DAC decoder inverts tokens back into waveform. ~10ms per call, but every AR loop converges here.

tokens → waveform @ 25fps
batched decodewindowed streaming

One structure rules them all: the delay pattern

Watch the cursor: at each step, codebook i only starts emitting i steps after codebook 0 (green — pitch and intonation). The staircase repeats forever.

It defines a four-phase per-request state machine — delay → active → wind-down → finished — that the encoder must emit, the AR engine must advance every step, and the vocoder must invert. Each deep dive below collides with it.

codebook 0 active waiting on delay
04

Deep Dives, Stage by Stage

PR numbers link to the public roadmap, sglang-omni #478: "low latency and high throughput with no model performance reduction."

A

Preprocessing — CPU work gets a CPU scheduler

CPU · tokenization + reference-audio I/O · ThreadedSimpleScheduler

Tokenize text, load and decode the reference audio. Pure CPU + I/O work — it needs none of the GPU scheduler's machinery (batching, KV cache, CUDA graphs).

So it runs under a CPU-side ThreadedSimpleScheduler — a thread pool (max_concurrency) that preprocesses several requests in parallel while the GPU stages work on earlier ones.

work  text → token ids · audio file → waveform tensor
cost  I/O-bound, negligible FLOPs
scheduler  ThreadedSimpleScheduler — CPU thread pool, no GPU machinery
SCHEDPipelined off the GPU's critical path

The thread pool preprocesses requests in parallel; while the GPU serves request 1, requests 2–4 are already prepped — the GPU never waits on I/O.

B

Encoder — the fastest compute is no compute

LRU cache · batched encoding · torch.compile

The encoder is expensive (50–100ms), deterministic, and repetitive — users reuse a handful of voices across thousands of prompts. Most encoder work is recomputation of known answers, so: cache it.

arch  DAC-style codec, weights shared with vocoder
out  [T, 8] tokens → rearranged into delay pattern
key fact  deterministic per input → cacheable
LRULRU cache keyed by reference audio

Seen this voice before? Return the precomputed tokens — 50–100ms saved per hit, and the GPU stays free for AR decode.

Request 2 reuses voice A: the 80ms encode collapses to a cache lookup. Voice B misses and pays full price.

COMPILEtorch.compile on the encoder

Static shapes, a convolutional stack, no branching — the textbook compile workload. Inductor fuses the small kernels into fewer, larger ones.

Same math, fewer launches, earlier finish. On a cache miss this stacks with the LRU above.

C

AR Engine — paged attention, radix cache, CUDA graphs

Qwen3-4B backbone · fused 8-codebook head · 400–800 steps per 10s of speech

Each step: backbone forward → fused 8-head projection [B, 8, V] → sample → advance the delay state machine → report to CPU.

At 400–800 steps per request the enemy is per-step fixed cost, not FLOPs: ~0.1ms of launch + sync overhead per step is ~80ms wasted per request.

per-step  forward · 8-head projection · sample · D2H
state machine  delay → active → wind-down → finished
enemy  launch + sync overhead × ~600 steps
PAGEDPaged attention for free

KV lives in fixed-size pages — no fragmentation, no over-reservation. That's what lets 16 AR loops share one H100 with the encoder and vocoder. Inherited free by reusing OmniScheduler.

contiguous
paged

A finishes, D needs 10 blocks: contiguous rejects (free but fragmented); paged scatters D across whatever is free.

RADIXRadix cache, partitioned by voice

RadixAttention keeps prior KV in a prefix tree; in TTS the shared prefix is the reference voice, namespaced via extra_key. It stacks with the encoder LRU: a repeated voice skips both the encode and the prefill.

Same shape as the LRU figure — that's the point: the two caches stack.

CONCEPTWhat a CUDA graph actually is

Each kernel launch costs the CPU 5–10µs; a step of dozens of tiny kernels leaves the GPU starving between them. A CUDA graph records the sequence once and replays it as a single launch:

✕ EAGER — CPU launches every kernel; GPU starves between launches
✓ CUDA GRAPH — one launch, kernels packed back-to-back

The recording is frozen — so the delay-pattern state machine was rewritten as branchless, in-place tensor ops on fixed buffers: control flow turned into data flow. (#503)

ASYNCAsync decode — launch-first one-step lookahead

Every step ends with a heavy host-side collect: sampler-state scatter, delay-pattern bookkeeping, EOC/finish handling, and a D2H of the codes snapshot — in Python, every step. The GPU used to sit idle through all of it.

The fix splits execute() in two: launch (forward + on-GPU sample + async D2H into a pinned ping-pong buffer + CUDA event) and resolve (the host collect). The loop runs launch(N) then resolve(N−1) — step N−1's CPU collect hides under step N's forward:

✕ SYNC — GPU idles through every step's host collect
✓ ASYNC LOOKAHEAD — forwards run back-to-back; collect(N−1) overlaps forward(N)

Output is bit-identical ON vs OFF; bs=1 takes a sync fast path (break-even by design — nothing to overlap). Full SeedTTS-EN @ concurrency 16: throughput +12.7%, mean latency −16.1%, RTF p99 −39% — the win grows with batch size and tail length.

D

Vocoder — streaming without the clicks

DAC decoder · ~10ms per call · batched decode + windowed streaming

Queueing: every AR loop converges here — 16 requests decoded serially leave the last waiting 240ms.

Streaming: audio can't be chunked like text — the codec carries state across time (chunks click at every seam), and mid-stream delay rows are incomplete (decoding them injects noise).

stride  75 frames — accumulate, then decode
overlap  8 frames — re-decode boundary + crossfade
hold-back  4 frames — incomplete tail rows wait
floor  ≥ N rows to invert the pattern → TTFB ≈ 300–400ms
BATCHBatched vocoder decode

Collect requests in a 2ms window, zero-pad, decode as one decode_batch() call, trim to true length.

Serial: the last request queues behind five others. Batched: everyone leaves in one GPU call.

RACENon-streaming vs streaming — same request

Measured against api.boson.ai: first audio at ~0.7s streaming vs ~1.5s non-streaming. Same generation cost — the difference is when sound starts.

STREAMThe window protocol, animated
codes arriving
decode window
emitted audio

Cyan leading edge = crossfade overlap; dashed cells = held-back tail. Real parameters 75 / 8 / 4, scaled for display.

Irreducible startup: the vocoder needs ≥ N rows to invert the pattern → streaming TTFB ≈ 300–400ms, under the 500ms conversational bar. Verify it in the demo below.
05

Live Demo: Clone Your Voice

Record your voice as the reference audio, type a sentence, and race non-streaming against streaming. Calls go directly from your browser to api.boson.ai · API docs.

POST /v1/audio/speech · model: higgs-audio-v3-tts
STEP 1 · REFERENCE VOICE3–15s of speech works best · skip to use the default voice
STEP 2 · TEXT TO SPEAK
STEP 3 · GENERATE — RACE THE TWO PATHS
NON-STREAMING
STREAMING (24kHz PCM)
// audio is sent to the Boson AI API for synthesis and not stored by this page · mic access stays in your browser
06

Overall Performance: One H100

Full Seed-TTS EN set (N=1088, mean of 3 runs), bf16 + CUDA graph, max_running_requests=16. RTF < 1 means faster than real time.

ConcurrencyThroughput req/sAvg latencyRTF / reqaudio s/s
11.62617 ms0.1476.89
22.70742 ms0.18011.37
45.45733 ms0.17722.84
88.91898 ms0.21737.38
1614.741079 ms0.26261.84

// 9× the audio throughput from 1→16 concurrency while per-request latency grows only 1.7× — the multi-stage async pipeline absorbing load.

AUDIO THROUGHPUT — SECONDS OF AUDIO PRODUCED PER SECOND
concurrency 16.89
concurrency 211.37
concurrency 422.84
concurrency 837.38
concurrency 1661.84
RTF 0.147 → 0.26216× the load, still far below real time

AND QUALITY HELD — WER/CER ACROSS PUBLIC BENCHMARKS

1.11
Seed-TTS · 2 languages
4.41
CV3 · 9 languages
2.74
MiniMax-Multilingual · 23 languages
3.61
Higgs-Multilingual · 111 languages & dialects

// Plus inline control tags: 20+ emotions, styles (singing, whispering, shouting), prosody (speed, pitch, pauses), and sound events (laughter, sigh, cough) — all composable.

"The best system optimizations come not from blindly applying popular techniques,
but from deep theoretical understanding, the desire to build elegant systems,
and clear-eyed engineering trade-offs."