Executive Summary

The Scale-5 configuration delivers 5 simultaneous full-duplex voice AI sessions on a single-GPU edge node. Suitable for internal R&D, prototype validation, and small-team conversational AI agents.

i
Full-duplex AI requires ~42-46GB VRAM for 5 concurrent sessions -- near the ceiling of an L40S 48GB. The A100 80GB is recommended for latency-sensitive deployments.
5
Concurrent Sessions
Full-duplex voice-in / voice-out simultaneously
~8GB
VRAM / Session
INT8 quantized 7B model + KV cache + Mimi codec
180ms
p99 Latency Target
First audio chunk after user speech
350W
Peak TDP
L40S under full 5-session load

VRAM Per-Session Breakdown

ComponentVRAM (per session)NotesShareable?
Model weights (INT8)7.0 GB7B params x 1 byte; shared across sessionsShared
KV Cache (user stream)0.8 GBAttention keys/values for user audio contextPer-session
KV Cache (agent stream)0.8 GBDual-stream: separate KV cache for agent audioPer-session
Mimi codec state0.6 GBConvNet encoder + decoder for 24kHz audioPer-session
Activation buffers0.4 GBForward pass activations during token generationPer-session
Audio I/O buffers0.2 GBWebSocket ring buffers mapped to GPU memoryPer-session
Total per session~8.8 GBWeighted average--

5-Session Aggregate: 46 GB VRAM Required

!
L40S 48GB headroom warning: With 46GB required and 48GB available, only ~2GB remains. A 6th session would OOM. For production stability, A100 80GB (34GB headroom) is strongly recommended.

GPU Selection & Justification

OK
Recommendation: Use L40S for budget-constrained R&D (up to 4 safe sessions). Use A100 80GB PCIe for production-grade 5-session deployments with stable headroom and lower latency.

Latency Model & Assumptions

SpecNVIDIA L40SNVIDIA A100 80GB
Latency ComponentL40S (ms)A100 (ms)Notes
WebSocket audio receive buffer10-1510-15Network stack; hardware-independent
Mimi encoder (speech to tokens)12-186-10HBM advantage on A100
Transformer forward pass15-228-14INT8, 7B model, single decode step
KV cache read (dual-stream)8-144-7Memory BW critical here
Mimi decoder (tokens to audio)12-186-10ConvNet decode, 24kHz output
Audio buffering (client-side)20-3020-30Browser AudioContext buffer
p99 total (5 sessions)180-220ms120-155msAll 5 sessions generating simultaneously

System Architecture

CLIENT: WebRTC getUserMedia() --> WebSocket (wss://) PCM Audio Stream (24kHz, Int16) | EDGE GATEWAY: nginx (TLS termination, WebSocket proxy) Rate limiting: 5 concurrent WS connections | INFERENCE SERVER: PersonaPlex / Moshi Server (Python + PyTorch) Session Manager: max_sessions=5 Continuous batching: dynamic batch size 1-5 KV Cache pool: pre-allocated 40GB | GPU: NVIDIA L40S 48GB OR NVIDIA A100 80GB PCIe Model weights: ~7GB (INT8 quantized) KV cache pool: ~38GB pre-allocated | STORAGE: NVMe SSD (model weights, voice prompt library) Redis: session state, transcript cache Prometheus + Grafana: metrics, VRAM, latency
Session Manager -- Why Not Kubernetes?

At 5 sessions, Kubernetes overhead consumes ~4-6GB RAM and adds 50-100ms scheduling latency. A direct Python session manager with asyncio is more efficient at this scale. K8s is justified at Scale-10+ where you need cross-node orchestration.

Continuous Batching vs Static Batching

Full-duplex audio is inherently streaming -- audio chunks arrive at 100ms intervals. Continuous batching dynamically groups decode steps of all active sessions into a single GPU kernel call, reducing launch overhead by ~60% vs static batching at 5 sessions.

Network Design

WebSocket audio streams require sustained low-latency bandwidth rather than high peak throughput. At 24kHz Int16 mono, each session consumes ~384 Kbps bidirectional.

ParameterValueNotes
Audio bitrate per session384 Kbps (bi-directional)24kHz, Int16, mono, both directions
5-session total1.92 Mbps sustainedWell within gigabit uplink
WebSocket overhead~40 Kbps per sessionFraming + ping/pong + JSON transcripts
Recommended uplink10 Mbps minimum5x headroom for burst + control traffic
TLS overhead~5-8%AES-256-GCM; negligible on modern CPUs

Power & Cooling

350W
GPU TDP (L40S)
Peak under 5-session full load
65W
CPU TDP
AMD EPYC / Intel Xeon host processor
450W
Total System
GPU + CPU + RAM + NVMe + networking
1500 BTU/h
Heat Output
1W = 3.41 BTU/h; requires ventilation

Standard server rack cooling (front-to-back airflow, 25 CFM minimum) is sufficient. No liquid cooling required at this scale. Ensure ambient temperature stays below 35 degC for reliable L40S operation.

Hardware Procurement List

ComponentSpecEst. PriceNotes
GPU (Option A)NVIDIA L40S 48GB PCIe$8,000-$10,000Budget choice; 4 safe sessions
GPU (Option B)NVIDIA A100 80GB PCIe$14,000-$18,000Recommended for production 5-session
Server1U/2U with PCIe 4.0 x16 slot$3,000-$5,000Dell R750, HP DL380, Supermicro
RAM64GB DDR5 ECC minimum$300-$500128GB recommended for OS + inference server
NVMe SSD1TB NVMe Gen4 (model + OS)$150-$250Samsung 990 Pro or WD Black SN850X
Networking10GbE NIC + switch port$200-$400Mellanox ConnectX-5 or Intel X710
UPS1500VA online UPS$500-$800APC Smart-UPS 1500; covers ~8 min at full load
Total Estimated CapEx (with A100)$18,150 -- $25,950
>Recommendation VRAM48 GB GDDR680 GB HBM2eA100 for production Memory Bandwidth864 GB/s2,000 GB/sA100 -- 2.3x advantage FP16 TFLOPS362 TFLOPS312 TFLOPSL40S higher raw FLOPS TDP350W400WL40S more power efficient Price (new)~$8,000-$10,000~$14,000-$18,000L40S for budget 5-session VRAM fit46GB / 48GB (95.8%)46GB / 80GB (57.5%)A100 wins
ManjuLAB Datacenter › PersonaPlex Scale-5
DOC-DC-001 · Rev 1.0 · May 2026

PersonaPlex Scale-5
Capacity Plan: 5 Concurrent Sessions

Engineering-grade GPU, VRAM, network, power, and cooling specifications for deploying full-duplex conversational AI at edge scale. Designed for ManjuLAB Columbus Ohio datacenter.

Concurrent Users5
Primary GPUL40S 48GB
VRAM Required71.4 GB
Est. Latency<320ms TTFT
Power Draw350-450W
FacilityColumbus, Ohio