Voice and Role Control for Full-Duplex Conversational AI
Overview
Conversational AI has historically forced an impossible trade-off. Traditional cascaded systems (ASR to LLM to TTS) offer voice and role customization but produce robotic conversations. Full-duplex models like Moshi finally made AI conversations feel natural, but locked users into a single fixed voice and role.
1o1 by ManjuLAB breaks this trade-off. Select from a diverse range of voices and define any role through a plain text prompt -- no retraining required. 1o1 delivers genuinely natural conversations, handles interruptions and backchannels, and maintains your chosen persona throughout.
Capabilities
Full-Duplex Interaction
1o1 listens and speaks simultaneously. This eliminates delays from cascaded systems and enables natural conversation behaviors -- when to pause, interrupt, or backchannel.
Hybrid Prompting
A voice prompt (audio embedding) captures vocal characteristics. A text prompt (natural language) describes the role and context. These are processed jointly. Any role is definable at inference time -- no fine-tuning needed.
Natural Backchanneling
1o1 produces contextual backchannels -- "okay", "yeah", "I see" -- that signal active listening without interrupting the speaker's flow.
Demonstration Examples
Architecture
Audio Embedding
Role & Context
7B · Dual-Stream · 24kHz
Full-Duplex Output
- 7 Billion Parameters -- sufficient scale for broad conversational competence and generalization
- Mimi Speech Codec -- ConvNet + Transformer encoder/decoder at 24kHz
- Temporal + Depth Transformers -- dual-stream allows listening and speaking concurrently
- Helium Language Model -- semantic understanding enabling generalization to novel scenarios
- Single-Stage Training -- real and synthetic conversations blended in one pass
Training Data
Evaluation Results
| System | Smooth Turn-Taking | User Interruption | Pause Handling | Average |
|---|---|---|---|---|
| 🥇 1o1 AI (ManjuLAB) | 90.8 | 95.0 | 100.0 | 95.3 |
| Moshi (Kyutai) | 60.6 | 82.1 | 94.1 | 78.9 |
| Gemini Live | 65.5<
Citation@article{whizyoga2026onetoone,
title = {Voice and Role Control for Full-Duplex Conversational AI},
author = {whizyoga},
year = {2026},
url = {https://yogabrata.com/research.html},
note = {ManjuLAB -- yogabrata.com}
}
Acknowledgments1o1 is built on the Moshi architecture from Kyutai (CC-BY-4.0). This work was developed by whizyoga at ManjuLAB. The original research that inspired this work is the PersonaPlex project from NVIDIA ADLR. Code and model weights are released under the MIT License. |