Executive Summary
The Scale-5 configuration delivers 5 simultaneous full-duplex voice AI sessions on a single-GPU edge node. Suitable for internal R&D, prototype validation, and small-team conversational AI agents.
VRAM Per-Session Breakdown
| Component | VRAM (per session) | Notes | Shareable? |
|---|---|---|---|
| Model weights (INT8) | 7.0 GB | 7B params x 1 byte; shared across sessions | Shared |
| KV Cache (user stream) | 0.8 GB | Attention keys/values for user audio context | Per-session |
| KV Cache (agent stream) | 0.8 GB | Dual-stream: separate KV cache for agent audio | Per-session |
| Mimi codec state | 0.6 GB | ConvNet encoder + decoder for 24kHz audio | Per-session |
| Activation buffers | 0.4 GB | Forward pass activations during token generation | Per-session |
| Audio I/O buffers | 0.2 GB | WebSocket ring buffers mapped to GPU memory | Per-session |
| Total per session | ~8.8 GB | Weighted average | -- |
5-Session Aggregate: 46 GB VRAM Required
GPU Selection & Justification
| Spec | NVIDIA L40S | NVIDIA A100 80GB |
|---|
| Latency Component | L40S (ms) | A100 (ms) | Notes |
|---|---|---|---|
| WebSocket audio receive buffer | 10-15 | 10-15 | Network stack; hardware-independent |
| Mimi encoder (speech to tokens) | 12-18 | 6-10 | HBM advantage on A100 |
| Transformer forward pass | 15-22 | 8-14 | INT8, 7B model, single decode step |
| KV cache read (dual-stream) | 8-14 | 4-7 | Memory BW critical here |
| Mimi decoder (tokens to audio) | 12-18 | 6-10 | ConvNet decode, 24kHz output |
| Audio buffering (client-side) | 20-30 | 20-30 | Browser AudioContext buffer |
| p99 total (5 sessions) | 180-220ms | 120-155ms | All 5 sessions generating simultaneously |
System Architecture
Session Manager -- Why Not Kubernetes?
At 5 sessions, Kubernetes overhead consumes ~4-6GB RAM and adds 50-100ms scheduling latency. A direct Python session manager with asyncio is more efficient at this scale. K8s is justified at Scale-10+ where you need cross-node orchestration.
Continuous Batching vs Static Batching
Full-duplex audio is inherently streaming -- audio chunks arrive at 100ms intervals. Continuous batching dynamically groups decode steps of all active sessions into a single GPU kernel call, reducing launch overhead by ~60% vs static batching at 5 sessions.
Network Design
WebSocket audio streams require sustained low-latency bandwidth rather than high peak throughput. At 24kHz Int16 mono, each session consumes ~384 Kbps bidirectional.
| Parameter | Value | Notes |
|---|---|---|
| Audio bitrate per session | 384 Kbps (bi-directional) | 24kHz, Int16, mono, both directions |
| 5-session total | 1.92 Mbps sustained | Well within gigabit uplink |
| WebSocket overhead | ~40 Kbps per session | Framing + ping/pong + JSON transcripts |
| Recommended uplink | 10 Mbps minimum | 5x headroom for burst + control traffic |
| TLS overhead | ~5-8% | AES-256-GCM; negligible on modern CPUs |
Power & Cooling
Standard server rack cooling (front-to-back airflow, 25 CFM minimum) is sufficient. No liquid cooling required at this scale. Ensure ambient temperature stays below 35 degC for reliable L40S operation.
Hardware Procurement List
| Component | Spec | Est. Price | Notes |
|---|---|---|---|
| GPU (Option A) | NVIDIA L40S 48GB PCIe | $8,000-$10,000 | Budget choice; 4 safe sessions |
| GPU (Option B) | NVIDIA A100 80GB PCIe | $14,000-$18,000 | Recommended for production 5-session |
| Server | 1U/2U with PCIe 4.0 x16 slot | $3,000-$5,000 | Dell R750, HP DL380, Supermicro |
| RAM | 64GB DDR5 ECC minimum | $300-$500 | 128GB recommended for OS + inference server |
| NVMe SSD | 1TB NVMe Gen4 (model + OS) | $150-$250 | Samsung 990 Pro or WD Black SN850X |
| Networking | 10GbE NIC + switch port | $200-$400 | Mellanox ConnectX-5 or Intel X710 |
| UPS | 1500VA online UPS | $500-$800 | APC Smart-UPS 1500; covers ~8 min at full load |