Datacenter Architecture

PersonaPlex Scale-10-25

H100 NVLink cluster for 10-25 concurrent full-duplex AI voice sessions

Planned infrastructure. Scale-10-25 requires dedicated GPU cluster procurement and colocation agreements.

Cluster Overview

GPU Config

  • 8x NVIDIA H100 SXM5 80GB
  • NVLink 4.0 fully connected
  • NVSwitch fabric 3.2 TB/s
  • 2x NUMA nodes per server

VRAM Budget

Total VRAM640 GB
Per session (10)64 GB
Per session (25)25.6 GB
OS/driver reserve~8 GB

Compute

Peak TFLOPS FP83,958
Active utilization60pct
Burst headroom40pct
Spare GPU1 hot standby

VRAM Scaling

SessionsModel WeightsKV Cache/sessionTotal VRAMHeadroom
1070 GB shared~5.7 GB~127 GB513 GB
1570 GB shared~6.2 GB~163 GB477 GB
2070 GB shared~6.8 GB~206 GB434 GB
2570 GB shared~7.5 GB~258 GB382 GB

Network: Spine-Leaf

Spine Layer

  • 2x 400GbE spine switches ECMP
  • Full mesh between spines
  • BGP route reflector design
  • Sub-5ms east-west latency

Leaf Layer

  • 4x 100GbE ToR leaf switches
  • 2x uplinks per leaf to each spine
  • LACP bonding on server NICs
  • Dedicated OOB management VLAN

GPU NICs

  • 2x NVIDIA ConnectX-7 400GbE
  • RDMA/RoCEv2 enabled
  • GPUDirect RDMA P2P
  • Jumbo frames MTU 9000

Uplinks

  • 2x 100Gbps DIA circuits BGP
  • Active/active failover
  • CDN offload for statics
  • DDoS scrubbing upstream

Power and Cooling

Power (2N)

H100 TDP x85,600 W
CPU+RAM+storage~800 W
Network gear~400 W
Total per server~6,800 W
2N PDU capacity15 kW/side
UPS runtime10+ min

Liquid Cooling

  • DLC cold plates on H100s
  • Rear-door heat exchanger
  • Coolant 30pct glycol/water
  • Inlet max 18C, delta-T 12C
  • PUE target 1.25 or better
  • N+1 CDU redundancy

Procurement Roadmap

Phase 1 - Q3 2026: Scale-10 Baseline

Procure 1x DGX H100 (8x H100 SXM5). Deploy spine-leaf PoC. Establish colocation (2 cabinets, 30A 3-phase). Install DLC. Target: 10 sessions, p99 250ms or less.

Phase 2 - Q4 2026: Scale-15 Expansion

Add 2nd DGX node. Full 4-leaf spine-leaf with redundant spines. Add 2nd 100Gbps circuit. Enable NVLink inter-node. Target: 15 sessions, 99.95pct SLA.

Phase 3 - Q1 2027: Scale-25 Full

Scale to 3x DGX nodes. Full 2N power with ATS. Multi-region session replication. GPU hot-spare pool. Target: 25 sessions, sub-200ms p99, 99.99pct SLA.

Latency Budget

StageTargetP99
Audio capture to VAD10ms15ms
ASR Whisper-large-v380ms120ms
LLM first token60ms90ms
TTS first chunk40ms60ms
WebSocket delivery10ms20ms
Total TTFA200ms305ms

Architecture Decisions

NVLink vs PCIe

NVLink 4.0 delivers 900 GB/s GPU-to-GPU bandwidth vs 64 GB/s PCIe 5.0. Critical for KV-cache sharing at scale-25.

Session Isolation

Each session runs in a dedicated CUDA stream with cgroup memory limits. GPU MPS disabled to prevent cross-tenant jitter.

Fault Tolerance

Session migration on GPU fault via checkpoint/restore. Recovery under 500ms. Persistent state in Redis AOF.