📄 ManjuLAB Research · May 2026

Voice and Role Control for Full-Duplex Conversational AI

ManjuLAB · yogabrata.com · whizyoga@manjulab.com · +1 425-502-1519

extended, and yours.

Overview

Conversational AI has historically forced an impossible trade-off. Traditional cascaded systems (ASR to LLM to TTS) offer voice and role customization but produce robotic conversations. Full-duplex models like Moshi finally made AI conversations feel natural, but locked users into a single fixed voice and role.

1o1 by ManjuLAB breaks this trade-off. Select from a diverse range of voices and define any role through a plain text prompt -- no retraining required. 1o1 delivers genuinely natural conversations, handles interruptions and backchannels, and maintains your chosen persona throughout.

Key insight: By combining voice prompting (audio embedding) with text prompting (natural language role definition) in a single hybrid system prompt, 1o1 disentangles speech naturalness from task-following behavior -- enabling both without compromise.

Capabilities

Full-Duplex Interaction

1o1 listens and speaks simultaneously. This eliminates delays from cascaded systems and enables natural conversation behaviors -- when to pause, interrupt, or backchannel.

Hybrid Prompting

A voice prompt (audio embedding) captures vocal characteristics. A text prompt (natural language) describes the role and context. These are processed jointly. Any role is definable at inference time -- no fine-tuning needed.

Natural Backchanneling

1o1 produces contextual backchannels -- "okay", "yeah", "I see" -- that signal active listening without interrupting the speaker's flow.

Demonstration Examples

Assistant

Wise Teacher

"You are a wise and friendly teacher. Answer questions or provide advice in a clear and engaging way."

0:42

Customer Service

Banking Support

"You work for First 1o1 Bank. Your name is Maya. The customer's transaction of $1,200 was flagged in Miami, FL."

1:08

Medical Office

Patient Registration

"You work for Dr. Kumar's office. Record full name, DOB, allergies, and prior conditions. Assure confidentiality."

1:22

Natural Backchanneling

Casual Conversation

"You enjoy having a good conversation."

0:55

Architecture

🎤 Voice Prompt
Audio Embedding

📝 Text Prompt
Role & Context

→

1o1 AI Core
7B · Dual-Stream · 24kHz

→

🔊 Natural Speech
Full-Duplex Output

7 Billion Parameters -- sufficient scale for broad conversational competence and generalization
Mimi Speech Codec -- ConvNet + Transformer encoder/decoder at 24kHz
Temporal + Depth Transformers -- dual-stream allows listening and speaking concurrently
Helium Language Model -- semantic understanding enabling generalization to novel scenarios
Single-Stage Training -- real and synthetic conversations blended in one pass

Training Data

7,303

Real Conversations

1,217 hours from Fisher English corpus. Source of natural backchanneling and authentic interaction patterns.

39,322

Synthetic Assistant

410 hours of question-answering dialogues. Transcripts generated by LLMs, synthesized via neural TTS.

105,410

Customer Service

1,840 hours across banking, medical, restaurant scenarios with rich contextual text prompts.

Evaluation Results

System	Smooth Turn-Taking	User Interruption	Pause Handling	Average
🥇 1o1 AI (ManjuLAB)	90.8	95.0	100.0	95.3
Moshi (Kyutai)	60.6	82.1	94.1	78.9
Gemini Live	65.5< Citation @article{whizyoga2026onetoone, title = {Voice and Role Control for Full-Duplex Conversational AI}, author = {whizyoga}, year = {2026}, url = {https://yogabrata.com/research.html}, note = {ManjuLAB -- yogabrata.com} } Acknowledgments 1o1 is built on the Moshi architecture from Kyutai (CC-BY-4.0). This work was developed by whizyoga at ManjuLAB. The original research that inspired this work is the PersonaPlex project from NVIDIA ADLR. Code and model weights are released under the MIT License. © 2026 ManjuLAB · whizyoga@manjulab.com · +1 425-502-1519 · 1o1.ai · yogabrata.com · github.com/1o1-ai /td>	89.1	71.8	75.5
Qwen 2.5 Omni	86.7	43.9	54.7	61.8
Freeze Omni	1.8	65.3	33.6	33.6

Key Findings

Efficient Specialization

Under 5,000 hours of directed data enables full task-following from pretrained weights.

Disentangled Naturalness

Blending synthetic and real data lets the model exhibit natural speech patterns alongside strong task-adherence.

Emergent Generalization

1o1 handles scenarios far outside training by leveraging broad pretraining from its language model foundation.

📄 Paper 🤗 Model Weights 💻 Code 🎛 Live Demo MIT License

🎛 Try 1o1 AI Live

Install the backend on your Windows PC and connect it to this demo -- experience full-duplex voice AI in real time.

Open Live Demo →

Voice and Role Control for Full-Duplex Conversational AI

Overview

Capabilities

Full-Duplex Interaction

Hybrid Prompting

Natural Backchanneling

Demonstration Examples

Architecture

Training Data

Evaluation Results

Citation

Acknowledgments

Key Findings