Moshi
Moshi is an experimental, end-to-end speech-to-speech AI system developed by Kyutai, a Paris-based nonprofit research lab, and released in July 2024 after six months of work by a team of eight.
Publisher review
Moshi is an experimental, end-to-end speech-to-speech AI system developed by Kyutai, a Paris-based nonprofit research lab, and released in July 2024 after six months of work by a team of eight. It is designed for researchers, developers, and enthusiasts who need a real-time voice AI that can listen and speak continuously without explicit turn-taking. Unlike typical assistants, Moshi is a prototype for advancing natural, expressive spoken interaction and is openly accessible for testing online from the Kyutai website, with models and code shared under permissive licenses (CC-BY 4.0, Apache 2.0, MIT). It targets those exploring low-latency conversational AI, local deployment on unconnected devices, or open alternatives to proprietary real-time speech APIs.
Moshi's architecture combines the Mimi encoder-decoder and an RQ-Transformer built on the Helium LLM. Mimi encodes audio into eight tokens per 80-millisecond timestep, trained on 7 million hours of English speech, using a discriminator loss and knowledge distillation from WavLM. Helium, a 7B-parameter transformer, was trained on 2.1 trillion text tokens (12.5% from Wikipedia) to predict text tokens, which then guide an additional transformer in predicting the next audio token. This end-to-end design achieves very low latency—enabling natural conversations limited to 5 minutes in the online demo—and handles overlapping speech (up to 20% of conversation) without being disrupted by interjections like “uh-huh.” The system also supports multimodal instruction-tuning, allowing expressive roleplay and emotional text-to-speech.
As the first openly accessible real-time voice AI, Moshi positions itself as an open alternative to proprietary systems like OpenAI's Realtime API and ChatGPT's Advanced Voice Mode. While OpenAI's offerings are closed-source and require cloud access, Moshi's weights and code are freely available, and it can run locally on unconnected devices. However, it lags behind commercial systems in practical robustness: it struggles with loops and interjections, and its conversational intelligence is not yet at the level of more mature models like Sesame AI's CSM. Kyutai's nonprofit status and open release aim to democratize voice AI research, but Moshi remains a prototype rather than a production-ready tool.
Honest trade-offs: Moshi's low latency and open licensing come at the cost of limited practical functionality. It is not yet fully functional for real-world applications, as it struggles with conversational loops and interjections that can derail coherence. The 5-minute session limit in the demo restricts extended use, and its English-only support narrows accessibility. Local installation requires significant compute resources, and the model's intelligence is notably lower than that of closed-source competitors. For researchers, these trade-offs are acceptable for studying real-time speech architectures; for developers seeking a drop-in voice assistant, they are prohibitive.
How it works
-
Low latency
Achieves very low latency for real-time conversation, enabling natural back-and-forth with minimal delay in the online demo.
-
End-to-end speech-to-speech
Uses a single integrated model to process audio input and generate audio output without separate ASR or TTS stages.
-
Continuous listening and responding
Always listens and generates sound, including silence, handling overlapping speech like interjections without explicit turn-taking.
-
Expressive and spontaneous voice
Supports emotional text-to-speech and roleplay, with the ability to convey hesitation, cut-offs, and other spoken nuances.
-
Multimodal instruction-tuning
Trained to align audio and text modalities, allowing the LLM's text predictions to inform audio generation for coherent responses.
-
Local installation capability
Can be installed and run on an unconnected device, enabling safe offline operation without cloud dependency.
-
Openly shared weights and code
Model weights and code are released under permissive licenses (CC-BY 4.0, Apache 2.0, MIT) for non-commercial and commercial use.
Strengths and trade-offs
Strengths
- First voice-enabled AI openly accessible to all, with weights and code freely shared under permissive licenses.
- World-first technology for smooth, natural, and expressive AI communication, demonstrated in live roleplay and coaching scenarios.
- Exceptional text-to-speech capabilities with emotion and interaction between multiple voices, as shown in the public demo.
- Can be installed locally for safe operation on unconnected devices, ensuring privacy and no cloud dependency.
Trade-offs
- Struggles with loops and interjections, leading to conversational incoherence in extended interactions.
- Not yet fully functional for practical applications, with a 5-minute session limit in the online demo.
- Conversational intelligence is notably lower than closed-source competitors like OpenAI's Advanced Voice Mode.
- English-only support and limited robustness for real-world deployment outside research contexts.
Pricing context
Free for non-commercial and commercial uses under CC-BY 4.0, Apache 2.0, and MIT licenses; no paid tiers.
Getting started with Moshi
-
Visit the Moshi demo page
Open your web browser and navigate to the Kyutai website's Moshi demo page. Click the start button to begin a real-time speech-to-speech conversation. The demo limits sessions to 5 minutes, so plan your test accordingly.
-
Download model weights and code
Go to the Kyutai GitHub repository or official release page. Download the Moshi model weights and source code, which are available under permissive licenses (CC-BY 4.0, Apache 2.0, MIT). Ensure you have sufficient storage and compute resources.
-
Install dependencies and set up environment
Set up a Python environment with the required libraries listed in the repository's documentation. Install dependencies such as PyTorch and any audio processing tools. Follow the provided setup script to configure the environment for local execution.
-
Run the local inference script
Execute the provided inference script to load the Moshi model on your local machine. Use a microphone and speakers for audio input and output. Test the system by speaking naturally and observing its real-time responses.
-
Experiment with multimodal instructions
Try different conversational scenarios, such as roleplay or emotional speech, by providing textual prompts that guide the model's tone and style. Adjust parameters in the script to explore expressive capabilities and note any limitations in coherence.
Frequently Asked Questions
What is Moshi AI and who created it?
Moshi is an experimental speech-to-speech AI system developed by Kyutai, a Paris-based nonprofit research lab. Released in July 2024, it enables real-time, natural voice conversations without explicit turn-taking and is openly accessible for testing online.
How does Moshi's speech-to-speech architecture work?
Moshi uses the Mimi encoder-decoder and an RQ-Transformer built on the Helium LLM. Mimi encodes audio into eight tokens per 80-millisecond timestep, trained on 7 million hours of English speech. Helium predicts text tokens that guide audio token generation for coherent responses.
Is Moshi AI free to use and open source?
Yes, Moshi is free for both non-commercial and commercial use. Its model weights and code are released under permissive licenses including CC-BY 4.0, Apache 2.0, and MIT, making it openly accessible for researchers and developers.
Can Moshi run locally on my own device?
Yes, Moshi can be installed and run on an unconnected device, enabling safe offline operation without cloud dependency. However, local installation requires significant compute resources, and the model's intelligence is lower than closed-source competitors.
How does Moshi compare to OpenAI's Advanced Voice Mode?
Moshi is an open alternative to proprietary systems like OpenAI's Realtime API and Advanced Voice Mode. While Moshi's weights and code are freely available and can run locally, it lags in conversational intelligence and robustness, struggling with loops and interjections.
What are the main limitations of Moshi for practical use?
Moshi struggles with conversational loops and interjections, leading to incoherence in extended interactions. The online demo has a 5-minute session limit, it supports English only, and its intelligence is notably lower than mature commercial models, making it a prototype for research.
Alternatives in this category
How Moshi compares
Direct head-to-head against 2 competitors. Picked by 7wData.
Moshi
- Pricing
- Free for non-commercial and commercial uses under CC-BY 4.0, Apache 2.0, and MIT licenses; no paid tiers.
- Target
- Moshi is an experimental, end-to-end speech-to-speech AI system developed by Kyutai, a Paris-based nonprofit research lab, and released in July 2024 after six months of
- Strength
- First voice-enabled AI openly accessible to all, with weights and code freely shared under permissive licenses.
- Watch for
- Struggles with loops and interjections, leading to conversational incoherence in extended interactions.
Calm Kids
- Pricing
- $69.99/year
- Target
- Children's sleep and mindfulness
- Deployment
- Mobile, Web
- Strength
- Broad mindfulness content
- Watch for
- Limited clinical validation
Headspace for Kids
- Pricing
- $69.99/year
- Target
- Children's mindfulness and sleep
- Deployment
- Mobile, Web
- Strength
- Engaging animations
- Watch for
- Less focus on sleep-specific content
User reviews
No user reviews yet. Be the first to write one.
Sources
Reporting on this tool draws on these publicly available sources.