Cluster Overview
GPU Config
- 8x NVIDIA H100 SXM5 80GB
- NVLink 4.0 fully connected
- NVSwitch fabric 3.2 TB/s
- 2x NUMA nodes per server
VRAM Budget
Compute
VRAM Scaling
| Sessions | Model Weights | KV Cache/session | Total VRAM | Headroom |
|---|---|---|---|---|
| 10 | 70 GB shared | ~5.7 GB | ~127 GB | 513 GB |
| 15 | 70 GB shared | ~6.2 GB | ~163 GB | 477 GB |
| 20 | 70 GB shared | ~6.8 GB | ~206 GB | 434 GB |
| 25 | 70 GB shared | ~7.5 GB | ~258 GB | 382 GB |
Network: Spine-Leaf
Spine Layer
- 2x 400GbE spine switches ECMP
- Full mesh between spines
- BGP route reflector design
- Sub-5ms east-west latency
Leaf Layer
- 4x 100GbE ToR leaf switches
- 2x uplinks per leaf to each spine
- LACP bonding on server NICs
- Dedicated OOB management VLAN
GPU NICs
- 2x NVIDIA ConnectX-7 400GbE
- RDMA/RoCEv2 enabled
- GPUDirect RDMA P2P
- Jumbo frames MTU 9000
Uplinks
- 2x 100Gbps DIA circuits BGP
- Active/active failover
- CDN offload for statics
- DDoS scrubbing upstream
Power and Cooling
Power (2N)
Liquid Cooling
- DLC cold plates on H100s
- Rear-door heat exchanger
- Coolant 30pct glycol/water
- Inlet max 18C, delta-T 12C
- PUE target 1.25 or better
- N+1 CDU redundancy
Procurement Roadmap
Phase 1 - Q3 2026: Scale-10 Baseline
Procure 1x DGX H100 (8x H100 SXM5). Deploy spine-leaf PoC. Establish colocation (2 cabinets, 30A 3-phase). Install DLC. Target: 10 sessions, p99 250ms or less.
Phase 2 - Q4 2026: Scale-15 Expansion
Add 2nd DGX node. Full 4-leaf spine-leaf with redundant spines. Add 2nd 100Gbps circuit. Enable NVLink inter-node. Target: 15 sessions, 99.95pct SLA.
Phase 3 - Q1 2027: Scale-25 Full
Scale to 3x DGX nodes. Full 2N power with ATS. Multi-region session replication. GPU hot-spare pool. Target: 25 sessions, sub-200ms p99, 99.99pct SLA.
Latency Budget
| Stage | Target | P99 |
|---|---|---|
| Audio capture to VAD | 10ms | 15ms |
| ASR Whisper-large-v3 | 80ms | 120ms |
| LLM first token | 60ms | 90ms |
| TTS first chunk | 40ms | 60ms |
| WebSocket delivery | 10ms | 20ms |
| Total TTFA | 200ms | 305ms |
Architecture Decisions
NVLink vs PCIe
NVLink 4.0 delivers 900 GB/s GPU-to-GPU bandwidth vs 64 GB/s PCIe 5.0. Critical for KV-cache sharing at scale-25.
Session Isolation
Each session runs in a dedicated CUDA stream with cgroup memory limits. GPU MPS disabled to prevent cross-tenant jitter.
Fault Tolerance
Session migration on GPU fault via checkpoint/restore. Recovery under 500ms. Persistent state in Redis AOF.