PersonaPlex Scale-10-25

Planned infrastructure. Scale-10-25 requires dedicated GPU cluster procurement and colocation agreements.

Cluster Overview

GPU Config

8x NVIDIA H100 SXM5 80GB
NVLink 4.0 fully connected
NVSwitch fabric 3.2 TB/s
2x NUMA nodes per server

VRAM Budget

Total VRAM640 GB

Per session (10)64 GB

Per session (25)25.6 GB

OS/driver reserve~8 GB

Compute

Peak TFLOPS FP83,958

Active utilization60pct

Burst headroom40pct

Spare GPU1 hot standby

VRAM Scaling

Sessions	Model Weights	KV Cache/session	Total VRAM	Headroom
10	70 GB shared	~5.7 GB	~127 GB	513 GB
15	70 GB shared	~6.2 GB	~163 GB	477 GB
20	70 GB shared	~6.8 GB	~206 GB	434 GB
25	70 GB shared	~7.5 GB	~258 GB	382 GB

Network: Spine-Leaf

Spine Layer

2x 400GbE spine switches ECMP
Full mesh between spines
BGP route reflector design
Sub-5ms east-west latency

Leaf Layer

4x 100GbE ToR leaf switches
2x uplinks per leaf to each spine
LACP bonding on server NICs
Dedicated OOB management VLAN

GPU NICs

2x NVIDIA ConnectX-7 400GbE
RDMA/RoCEv2 enabled
GPUDirect RDMA P2P
Jumbo frames MTU 9000

Uplinks

2x 100Gbps DIA circuits BGP
Active/active failover
CDN offload for statics
DDoS scrubbing upstream

Power and Cooling

Power (2N)

H100 TDP x85,600 W

CPU+RAM+storage~800 W

Network gear~400 W

Total per server~6,800 W

2N PDU capacity15 kW/side

UPS runtime10+ min

Liquid Cooling

DLC cold plates on H100s
Rear-door heat exchanger
Coolant 30pct glycol/water
Inlet max 18C, delta-T 12C
PUE target 1.25 or better
N+1 CDU redundancy

Procurement Roadmap

Phase 1 - Q3 2026: Scale-10 Baseline

Procure 1x DGX H100 (8x H100 SXM5). Deploy spine-leaf PoC. Establish colocation (2 cabinets, 30A 3-phase). Install DLC. Target: 10 sessions, p99 250ms or less.

Phase 2 - Q4 2026: Scale-15 Expansion

Add 2nd DGX node. Full 4-leaf spine-leaf with redundant spines. Add 2nd 100Gbps circuit. Enable NVLink inter-node. Target: 15 sessions, 99.95pct SLA.

Phase 3 - Q1 2027: Scale-25 Full

Scale to 3x DGX nodes. Full 2N power with ATS. Multi-region session replication. GPU hot-spare pool. Target: 25 sessions, sub-200ms p99, 99.99pct SLA.

Latency Budget

Stage	Target	P99
Audio capture to VAD	10ms	15ms
ASR Whisper-large-v3	80ms	120ms
LLM first token	60ms	90ms
TTS first chunk	40ms	60ms
WebSocket delivery	10ms	20ms
Total TTFA	200ms	305ms

Architecture Decisions

NVLink vs PCIe

NVLink 4.0 delivers 900 GB/s GPU-to-GPU bandwidth vs 64 GB/s PCIe 5.0. Critical for KV-cache sharing at scale-25.

Session Isolation

Each session runs in a dedicated CUDA stream with cgroup memory limits. GPU MPS disabled to prevent cross-tenant jitter.

Fault Tolerance

Session migration on GPU fault via checkpoint/restore. Recovery under 500ms. Persistent state in Redis AOF.