K8s + Helm Architecture -- Lean AI Infrastructure

System Architecture

The full stack runs in Kubernetes (K3s for single-node), with one pod per service tier. GPU access is handled by the NVIDIA GPU Operator -- no manual driver configuration required.

           +------------------+
           |    Users         |
           +--------+---------+
                    | HTTPS / WebSocket (wss://)
                    v
        +---------------------+
        |  Ingress (NGINX)    |  <- SSL termination, WS proxy
        |  yogabrata.com      |    Rate limit: max sessions
        +----------+----------+
                   v
        +---------------------+
        |  API Gateway Pod    |  <- FastAPI / Node.js
        |  Session management |    Voice + text prompt routing
        |  Auth + rate limits |    Transcript streaming
        +----------+----------+
                   v
     +----------------------------+
     |  PersonaPlex Inference Pod |  <- vLLM / TensorRT-LLM
     |  GPU-enabled               |    Moshi 7B INT8 quantized
     |  Continuous batching       |    Mimi encoder + decoder
     |  KV cache pool             |    WebSocket audio streaming
     +--------+-------------------+
              v              v
   +--------------+  +---------------+
   | Qdrant       |  | Redis         |
   | Vector DB    |  | Session cache |
   | Voice embeds |  | Prompt config |
   +--------------+  +---------------+

   All pods run in: namespace=manjulab-ai
   GPU scheduling: nvidia.com/gpu resource requests
   Monitoring: Prometheus + Grafana sidecar

Single-cluster simplicity: All components run in one K3s cluster, one namespace. No service mesh required at this scale. Add Istio only if you need mTLS between pods or advanced traffic shaping at 50+ users.

K3s -- Lightweight Kubernetes

K3s is a CNCF-certified Kubernetes distribution in a single <70MB binary. It eliminates the overhead of full K8s while remaining 100% API-compatible. Ideal for single-node GPU deployments.

Installation

# Install K3s (runs as a systemd service)
curl -sfL https://get.k3s.io | sh -

# Verify cluster is running
kubectl get nodes
# NAME        STATUS   ROLES                  AGE   VERSION
# gpu-node-1  Ready    control-plane,master   1m    v1.29.0+k3s1

# Set KUBECONFIG for helm/kubectl access
export KUBECONFIG=/etc/rancher/k3s/k3s.yaml

# Create namespace for all ManjuLAB AI services
kubectl create namespace manjulab-ai

Feature	K3s	Full K8s (kubeadm)
Install time	<2 minutes	30-60 minutes
RAM overhead	~512 MB	~4-6 GB
etcd	SQLite (single node)	etcd cluster (3+ nodes)
GPU support	Via GPU Operator	Via GPU Operator
API compatibility	100% K8s API	100% K8s API
Best for	1-3 node GPU clusters	5+ node clusters

Helm Chart Structure

personaplex-helm/
|
+-- Chart.yaml                  # Chart metadata, version, dependencies
+-- README.md
|
+-- charts/                     # Sub-charts (one per service)
|   +-- inference-engine/       # GPU inference pod (vLLM)
|   |   +-- Chart.yaml
|   |   +-- deployment.yaml
|   |   +-- service.yaml
|   |   +-- configmap.yaml
|   +-- api-gateway/            # FastAPI session manager
|   |   +-- Chart.yaml
|   |   +-- deployment.yaml
|   |   +-- service.yaml
|   +-- vector-db/              # Qdrant vector database
|   |   +-- Chart.yaml
|   |   +-- deployment.yaml
|   |   +-- service.yaml
|   |   +-- pvc.yaml
|   +-- redis-memory/           # Redis session cache
|       +-- Chart.yaml
|       +-- deployment.yaml
|       +-- service.yaml
|
+-- values/
|   +-- dev.yaml                # 1 GPU, 3 sessions, local only
|   +-- prod.yaml               # 2-4 GPUs, 25 sessions, public
|   +-- edge.yaml               # 1 GPU, air-gapped, no internet
|
+-- templates/
    +-- ingress.yaml
    +-- namespace.yaml
    +-- NOTES.txt

Deploy Everything in One Command

# Install the full stack (dev environment) helm install manjulab-ai ./personaplex-helm \ --namespace manjulab-ai \ --values values/dev.yaml \

Inference Engine Pod (GPU Core)

# charts/inference-engine/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: personaplex-inference
  namespace: manjulab-ai
spec:
  replicas: 1  # 1 per GPU node; increase with more nodes
  selector:
    matchLabels:
      app: personaplex
  template:
    metadata:
      labels:
        app: personaplex
    spec:
      containers:
        - name: inference
          image: vllm/vllm-openai:latest
          ports:
            - containerPort: 8000
            - containerPort: 8998  # WebSocket audio port
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: "56Gi"   # L40S: 56Gi / A100: 80Gi
            requests:
              memory: "24Gi"
              cpu: "8"
          env:
            - name: MODEL_NAME
              value: "1o1-ai/moshi-7b-int8"
            - name: MAX_CONCURRENT_SESSIONS
              value: "5"
            - name: SAMPLE_RATE
              value: "24000"
            - name: ENABLE_CONTINUOUS_BATCHING
              value: "true"
            - name: HF_TOKEN
              valueFrom:
                secretKeyRef:
                  name: hf-credentials
                  key: token
          volumeMounts:
            - name: model-weights
              mountPath: /models
      volumes:
        - name: model-weights
          hostPath:
            path: /data/models  # Pre-downloaded weights on host
      tolerations:
        - key: nvidia.com/gpu
          operator: Exists
          effect: NoSchedule

NVIDIA GPU Operator

The GPU Operator automatically manages all NVIDIA software components on Kubernetes nodes -- drivers, container toolkit, device plugin, and MIG configuration. Zero manual GPU driver installation needed.

# Add NVIDIA Helm repo
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

# Install GPU Operator (handles drivers, toolkit, device plugin)
helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set toolkit.enabled=true \
  --set devicePlugin.enabled=true \
  --set mig.strategy=mixed

# Verify GPU is visible to Kubernetes
kubectl get nodes -o json | jq '.items[].status.capacity'
# {
#   "cpu": "32",
#   "memory": "262144000Ki",
#   "nvidia.com/gpu": "1"    <- GPU visible to scheduler
# }

API Gateway Pod (FastAPI)

# charts/api-gateway/deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: persona-api
  namespace: manjulab-ai
spec:
  replicas: 2  # API is CPU-only; horizontal scale freely
  selector:
    matchLabels: {app: persona-api}
  template:
    metadata:
      labels: {app: persona-api}
    spec:
      containers:
        - name: api
          image: manjulab/1o1-api:latest
          command: ["uvicorn"]
          args: ["app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "4"]
          ports:
            - containerPort: 8000
          resources:
            limits: {cpu: "2", memory: "2Gi"}
            requests: {cpu: "500m", memory: "512Mi"}
          env:
            - name: INFERENCE_URL
              value: "http://personaplex-inference:8998"
            - name: REDIS_URL
              value: "redis://redis-memory:6379"
            - name: QDRANT_URL
              value: "http://qdrant:6333"

Vector DB -- Qdrant

Qdrant stores voice persona embeddings (audio fingerprints of reference voices). When a user selects a voice or provides a voice sample, Qdrant performs nearest-neighbor search to retrieve the closest matching voice embedding from the library.

apiVersion: apps/v1
kind: Deployment
metadata:
  name: qdrant
  namespace: manjulab-ai
spec:
  replicas: 1
  selector:
    matchLabels: {app: qdrant}
  template:
    metadata:
      labels: {app: qdrant}
    spec:
      containers:
        - name: qdrant
          image: qdrant/qdrant:latest
          ports:
            - containerPort: 6333  # REST API
            - containerPort: 6334  # gRPC
          resources:
            limits: {cpu: "2", memory: "8Gi"}
          volumeMounts:
            - name: qdrant-storage
              mountPath: /qdrant/storage
      volumes:
        - name: qdrant-storage
          persistentVolumeClaim:
            claimName: qdrant-pvc

Redis -- Session Memory Layer

apiVersion: apps/v1
kind: Deployment
metadata:
  name: redis-memory
  namespace: manjulab-ai
spec:
  replicas: 1
  selector:
    matchLabels: {app: redis}
  template:
    metadata:
      labels: {app: redis}
    spec:
      containers:
        - name: redis
          image: redis:7-alpine
          args: ["--maxmemory", "4gb", "--maxmemory-policy", "allkeys-lru"]
          ports:
            - containerPort: 6379
          resources:
            limits: {memory: "5Gi", cpu: "1"}

Ingress -- Expose to yogabrata.com

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: persona-ingress
  namespace: manjulab-ai
  annotations:
    nginx.ingress.kubernetes.io/proxy-read-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-send-timeout: "3600"
    nginx.ingress.kubernetes.io/proxy-http-version: "1.1"
    nginx.ingress.kubernetes.io/configuration-snippet: |
      proxy_set_header Upgrade $http_upgrade;
      proxy_set_header Connection "upgrade";
    cert-manager.io/cluster-issuer: "letsencrypt-prod"
spec:
  tls:
    - hosts: [yogabrata.com]
      secretName: yogabrata-tls
  rules:
    - host: yogabrata.com
      http:
        paths:
          - path: /api
            pathType: Prefix
            backend:
              service:
                name: persona-api
                port: {number: 8000}
          - path: /ws
            pathType: Prefix
            backend:
              service:
                name: personaplex-inference
                port: {number: 8998}  # Direct WS to inference

WebSocket timeout: Set proxy-read-timeout: 3600 on the ingress -- default nginx timeout is 60s which will kill active AI conversations. Full-duplex sessions can run for minutes or hours.

Why GPU Scaling Is NOT Linear for Conversational AI

This is one of the most important concepts for capacity planning. Standard web services scale linearly: 2x servers = 2x capacity. GPU-based full-duplex AI does not.

1. Memory Bandwidth Contention -- The Primary Bottleneck

Each concurrent session requires continuous KV cache reads at every decode step (every ~12ms for 24kHz audio at 100ms chunks). With 5 sessions sharing a single GPU's memory bus, bandwidth per session drops:

L40S (864 GB/s total):
1 session: 864 GB/s available -> p99 latency: ~90ms
3 sessions: 288 GB/s each -> p99 latency: ~130ms
5 sessions: 173 GB/s each -> p99 latency: ~180ms (+100% vs 1 session)
6 sessions: 144 GB/s each -> p99 latency: ~240ms (degrades sharply)

# Doubling sessions does NOT halve throughput -- it degrades latency super-linearly

2. Continuous Batching Efficiency Curve

Continuous batching improves GPU utilization by grouping decode steps across sessions. But efficiency plateaus and then drops: small batches (1-3) are compute-underutilized; optimal batches (4-8) maximize throughput; large batches (10+) introduce head-of-line blocking.

Batch size vs GPU utilization (empirical, 7B INT8, L40S):
Batch=1:  GPU util 12%  -- massively underutilized
Batch=4:  GPU util 55%  -- good efficiency
Batch=8:  GPU util 72%  -- near optimal (target operating point)
Batch=12: GPU util 71%  -- plateau; memory BW now bottleneck
Batch=16: GPU util 68%  -- degrading; HoL blocking starts

3. Multi-GPU Does Not Fix Single-Session Latency

Adding a second GPU does NOT make any single session faster -- it only increases the number of sessions you can run. Single-session latency is bounded by per-GPU memory bandwidth and compute speed. The only way to reduce single-session latency is: (1) faster GPU memory (H100 HBM3), (2) smaller model (3B vs 7B), or (3) more aggressive quantization (FP8 vs INT8).

4. Practical Capacity Rule of Thumb

# Conservative safe operating limits (p99 < 200ms):
RTX 4090 24GB:     3 sessions  (VRAM limited)
L40S 48GB:         4 sessions  (VRAM + BW limited)
A100 80GB:         6 sessions  (BW limited)
H100 80GB HBM3:    9 sessions  (BW 3.9x A100; VRAM limited)
2x H100 NVLink:   12 sessions  (pooled VRAM + NVLink BW)
4x H100 NVLink:   25 sessions  (MIG partitioned)

# Rule: sessions do not scale linearly with GPU count
# NVLink-connected GPUs scale better than PCIe-connected due to unified VRAM

Minimum Hardware Requirements

Phased Rollout Roadmap

Phase 1 -- Now

MVP (1 GPU)

1x L40S or RTX 4090
K3s single-node
3 pods total
5 concurrent sessions
Dev.yaml values
Local or RunPod

$0-$18K CapEx

Phase 2 -- Growth

Small Production (2 GPUs)

2x A100 80GB PCIe
K3s 2-node cluster
Replicated inference pods
10-12 concurrent sessions
HAProxy load balancer
Monitoring + alerting

$35K-$50K CapEx

Phase 3 -- Scale

Enterprise (4x H100)

4x H100 SXM5 NVLink
Spine-leaf networking
Liquid cooling + UPS 2N
25 concurrent sessions
Full K8s (not K3s)
Columbus, Ohio DC

$280K-$420K CapEx

d>4x H100 SXM5

Component	Dev / MVP	Small Prod	Full Scale-25
GPU	1x RTX 4090 / L40S	2x L40S or 1x A100
Sessions	3-5	8-12	20-25
CPU	8-16 cores	16-32 cores	64+ cores
RAM	64 GB	128 GB	512 GB
Storage	500 GB NVMe	2 TB NVMe RAID	10 TB NVMe cluster
Network	1 GbE	10 GbE	25 GbE + spine-leaf
Est. Cost	$8K-$18K	$35K-$70K	$280K-$420K

Cloud alternative: RunPod, Vast.ai, or Lambda Labs rent L40S from ~$1.25/hr or H100 from ~$2.49/hr. For dev and testing, renting is cheaper than buying until you exceed ~800 hrs/month of usage.

Redis Key Pattern	TTL	Content
session:{id}:config	30 min	Voice prompt embedding ID, text prompt, sample rate, chunk size
session:{id}:transcript	30 min	Running transcript buffer (user + AI turns)
session:{id}:metrics	5 min	Per-session latency p50/p99, bytes sent/recv
gpu:{id}:health	10 sec	VRAM %, active sessions, p99 latency -- updated by Prometheus exporter

--set inference.modelName="moshi-7b-int8" \ --set inference.maxSessions=5 # Upgrade to production (add GPUs, increase sessions) helm upgrade manjulab-ai ./personaplex-helm \ --namespace manjulab-ai \ --values values/prod.yaml \ --set inference.replicas=2 \ --set inference.maxSessions=25

K3s + Helm + GPU

Inference Engine Pod (GPU Core)

NVIDIA GPU Operator

API Gateway Pod (FastAPI)

Vector DB -- Qdrant

Redis -- Session Memory Layer

Ingress -- Expose to yogabrata.com

Why GPU Scaling Is NOT Linear for Conversational AI

Minimum Hardware Requirements

Phased Rollout Roadmap

PersonaPlex on Kubernetes
Lean, Practical AI Infrastructure Design

Core Design Philosophy

Ingress -- Expose to yogabrata.com

Why GPU Scaling Is NOT Linear for Conversational AI

Minimum Hardware Requirements

Phased Rollout Roadmap

PersonaPlex on KubernetesLean, Practical AI Infrastructure Design

Core Design Philosophy

PersonaPlex on Kubernetes
Lean, Practical AI Infrastructure Design