Continuous batching improves GPU utilization by grouping decode steps across sessions. But efficiency plateaus and then drops: small batches (1-3) are compute-underutilized; optimal batches (4-8) maximize throughput; large batches (10+) introduce head-of-line blocking.
Batch size vs GPU utilization (empirical, 7B INT8, L40S):
Batch=1: GPU util 12% -- massively underutilized
Batch=4: GPU util 55% -- good efficiency
Batch=8: GPU util 72% -- near optimal (target operating point)
Batch=12: GPU util 71% -- plateau; memory BW now bottleneck
Batch=16: GPU util 68% -- degrading; HoL blocking starts