FSDP memory · estimator

Model

preset

hidden size

intermediate

layers

heads

kv heads

vocab

tie embeddings

Training

micro batch

seq len k

precision

optimizer

checkpointing

FlashAttention

checkpoint lm_head

Parallelism & hardware

total GPUs

DP = GPUs / (TP × CP) 8

FSDP

device

Per-GPU peak memory

— GB

—

Where the bytes go

Each component, sized to its share of the per-GPU footprint.

component	GB	%
Total	—	100%

Step-by-step formulas behind each memory component.

Per-GPU peak as you add more workers, holding everything else equal.