PersonaPlex Scale-5 -- 5 Concurrent Users Capacity Plan

Executive Summary

The Scale-5 configuration delivers 5 simultaneous full-duplex voice AI sessions on a single-GPU edge node. Suitable for internal R&D, prototype validation, and small-team conversational AI agents.

Full-duplex AI requires ~42-46GB VRAM for 5 concurrent sessions -- near the ceiling of an L40S 48GB. The A100 80GB is recommended for latency-sensitive deployments.

Concurrent Sessions

Full-duplex voice-in / voice-out simultaneously

~8GB

VRAM / Session

INT8 quantized 7B model + KV cache + Mimi codec

180ms

p99 Latency Target

First audio chunk after user speech

350W

Peak TDP

L40S under full 5-session load

VRAM Per-Session Breakdown

Component	VRAM (per session)	Notes	Shareable?
Model weights (INT8)	7.0 GB	7B params x 1 byte; shared across sessions	Shared
KV Cache (user stream)	0.8 GB	Attention keys/values for user audio context	Per-session
KV Cache (agent stream)	0.8 GB	Dual-stream: separate KV cache for agent audio	Per-session
Mimi codec state	0.6 GB	ConvNet encoder + decoder for 24kHz audio	Per-session
Activation buffers	0.4 GB	Forward pass activations during token generation	Per-session
Audio I/O buffers	0.2 GB	WebSocket ring buffers mapped to GPU memory	Per-session
Total per session	~8.8 GB	Weighted average	--

5-Session Aggregate: 46 GB VRAM Required

L40S 48GB headroom warning: With 46GB required and 48GB available, only ~2GB remains. A 6th session would OOM. For production stability, A100 80GB (34GB headroom) is strongly recommended.

GPU Selection & Justification

Recommendation: Use L40S for budget-constrained R&D (up to 4 safe sessions). Use A100 80GB PCIe for production-grade 5-session deployments with stable headroom and lower latency.

Latency Model & Assumptions

Spec	NVIDIA L40S	NVIDIA A100 80GB

Latency Component	L40S (ms)	A100 (ms)	Notes
WebSocket audio receive buffer	10-15	10-15	Network stack; hardware-independent
Mimi encoder (speech to tokens)	12-18	6-10	HBM advantage on A100
Transformer forward pass	15-22	8-14	INT8, 7B model, single decode step
KV cache read (dual-stream)	8-14	4-7	Memory BW critical here
Mimi decoder (tokens to audio)	12-18	6-10	ConvNet decode, 24kHz output
Audio buffering (client-side)	20-30	20-30	Browser AudioContext buffer
p99 total (5 sessions)	180-220ms	120-155ms	All 5 sessions generating simultaneously

System Architecture

CLIENT: WebRTC getUserMedia() --> WebSocket (wss://)
PCM Audio Stream (24kHz, Int16)
     |
EDGE GATEWAY: nginx (TLS termination, WebSocket proxy)
Rate limiting: 5 concurrent WS connections
     |
INFERENCE SERVER: PersonaPlex / Moshi Server (Python + PyTorch)
Session Manager: max_sessions=5
Continuous batching: dynamic batch size 1-5
KV Cache pool: pre-allocated 40GB
     |
GPU: NVIDIA L40S 48GB OR NVIDIA A100 80GB PCIe
Model weights: ~7GB (INT8 quantized)
KV cache pool: ~38GB pre-allocated
     |
STORAGE: NVMe SSD (model weights, voice prompt library)
Redis: session state, transcript cache
Prometheus + Grafana: metrics, VRAM, latency
  

Session Manager -- Why Not Kubernetes?

At 5 sessions, Kubernetes overhead consumes ~4-6GB RAM and adds 50-100ms scheduling latency. A direct Python session manager with asyncio is more efficient at this scale. K8s is justified at Scale-10+ where you need cross-node orchestration.

Continuous Batching vs Static Batching

Full-duplex audio is inherently streaming -- audio chunks arrive at 100ms intervals. Continuous batching dynamically groups decode steps of all active sessions into a single GPU kernel call, reducing launch overhead by ~60% vs static batching at 5 sessions.

Network Design

WebSocket audio streams require sustained low-latency bandwidth rather than high peak throughput. At 24kHz Int16 mono, each session consumes ~384 Kbps bidirectional.

Parameter	Value	Notes
Audio bitrate per session	384 Kbps (bi-directional)	24kHz, Int16, mono, both directions
5-session total	1.92 Mbps sustained	Well within gigabit uplink
WebSocket overhead	~40 Kbps per session	Framing + ping/pong + JSON transcripts
Recommended uplink	10 Mbps minimum	5x headroom for burst + control traffic
TLS overhead	~5-8%	AES-256-GCM; negligible on modern CPUs

Power & Cooling

350W

GPU TDP (L40S)

Peak under 5-session full load

65W

CPU TDP

AMD EPYC / Intel Xeon host processor

450W

Total System

GPU + CPU + RAM + NVMe + networking

1500 BTU/h

Heat Output

1W = 3.41 BTU/h; requires ventilation

Standard server rack cooling (front-to-back airflow, 25 CFM minimum) is sufficient. No liquid cooling required at this scale. Ensure ambient temperature stays below 35 degC for reliable L40S operation.

Hardware Procurement List

Component	Spec	Est. Price	Notes
GPU (Option A)	NVIDIA L40S 48GB PCIe	$8,000-$10,000	Budget choice; 4 safe sessions
GPU (Option B)	NVIDIA A100 80GB PCIe	$14,000-$18,000	Recommended for production 5-session
Server	1U/2U with PCIe 4.0 x16 slot	$3,000-$5,000	Dell R750, HP DL380, Supermicro
RAM	64GB DDR5 ECC minimum	$300-$500	128GB recommended for OS + inference server
NVMe SSD	1TB NVMe Gen4 (model + OS)	$150-$250	Samsung 990 Pro or WD Black SN850X
Networking	10GbE NIC + switch port	$200-$400	Mellanox ConnectX-5 or Intel X710
UPS	1500VA online UPS	$500-$800	APC Smart-UPS 1500; covers ~8 min at full load

Total Estimated CapEx (with A100)$18,150 -- $25,950

Table of Contents

Executive Summary

VRAM Per-Session Breakdown

5-Session Aggregate: 46 GB VRAM Required

GPU Selection & Justification

Latency Model & Assumptions

System Architecture

Network Design

Power & Cooling

Hardware Procurement List

PersonaPlex Scale-5
Capacity Plan: 5 Concurrent Sessions

Table of Contents

Executive Summary

VRAM Per-Session Breakdown

5-Session Aggregate: 46 GB VRAM Required

GPU Selection & Justification

Latency Model & Assumptions

System Architecture

Network Design

Power & Cooling

Hardware Procurement List

PersonaPlex Scale-5Capacity Plan: 5 Concurrent Sessions

PersonaPlex Scale-5
Capacity Plan: 5 Concurrent Sessions