Moshi

Moshi is an experimental, end-to-end speech-to-speech AI system developed by Kyutai, a Paris-based nonprofit research lab, and released in July 2024 after six months of work by a team of eight.

Updated 15 days ago Reviewed by 7wData

Publisher review

Moshi is an experimental, end-to-end speech-to-speech AI system developed by Kyutai, a Paris-based nonprofit research lab, and released in July 2024 after six months of work by a team of eight. It is designed for researchers, developers, and enthusiasts who need a real-time voice AI that can listen and speak continuously without explicit turn-taking. Unlike typical assistants, Moshi is a prototype for advancing natural, expressive spoken interaction and is openly accessible for testing online from the Kyutai website, with models and code shared under permissive licenses (CC-BY 4.0, Apache 2.0, MIT). It targets those exploring low-latency conversational AI, local deployment on unconnected devices, or open alternatives to proprietary real-time speech APIs.

Moshi's architecture combines the Mimi encoder-decoder and an RQ-Transformer built on the Helium LLM. Mimi encodes audio into eight tokens per 80-millisecond timestep, trained on 7 million hours of English speech, using a discriminator loss and knowledge distillation from WavLM. Helium, a 7B-parameter transformer, was trained on 2.1 trillion text tokens (12.5% from Wikipedia) to predict text tokens, which then guide an additional transformer in predicting the next audio token. This end-to-end design achieves very low latency—enabling natural conversations limited to 5 minutes in the online demo—and handles overlapping speech (up to 20% of conversation) without being disrupted by interjections like “uh-huh.” The system also supports multimodal instruction-tuning, allowing expressive roleplay and emotional text-to-speech.

As the first openly accessible real-time voice AI, Moshi positions itself as an open alternative to proprietary systems like OpenAI's Realtime API and ChatGPT's Advanced Voice Mode. While OpenAI's offerings are closed-source and require cloud access, Moshi's weights and code are freely available, and it can run locally on unconnected devices. However, it lags behind commercial systems in practical robustness: it struggles with loops and interjections, and its conversational intelligence is not yet at the level of more mature models like Sesame AI's CSM. Kyutai's nonprofit status and open release aim to democratize voice AI research, but Moshi remains a prototype rather than a production-ready tool.

Honest trade-offs: Moshi's low latency and open licensing come at the cost of limited practical functionality. It is not yet fully functional for real-world applications, as it struggles with conversational loops and interjections that can derail coherence. The 5-minute session limit in the demo restricts extended use, and its English-only support narrows accessibility. Local installation requires significant compute resources, and the model's intelligence is notably lower than that of closed-source competitors. For researchers, these trade-offs are acceptable for studying real-time speech architectures; for developers seeking a drop-in voice assistant, they are prohibitive.

How it works

Low latency

Achieves very low latency for real-time conversation, enabling natural back-and-forth with minimal delay in the online demo.
End-to-end speech-to-speech

Uses a single integrated model to process audio input and generate audio output without separate ASR or TTS stages.
Continuous listening and responding

Always listens and generates sound, including silence, handling overlapping speech like interjections without explicit turn-taking.
Expressive and spontaneous voice

Supports emotional text-to-speech and roleplay, with the ability to convey hesitation, cut-offs, and other spoken nuances.
Multimodal instruction-tuning

Trained to align audio and text modalities, allowing the LLM's text predictions to inform audio generation for coherent responses.
Local installation capability

Can be installed and run on an unconnected device, enabling safe offline operation without cloud dependency.
Openly shared weights and code

Model weights and code are released under permissive licenses (CC-BY 4.0, Apache 2.0, MIT) for non-commercial and commercial use.

Strengths and trade-offs

Strengths

First voice-enabled AI openly accessible to all, with weights and code freely shared under permissive licenses.
World-first technology for smooth, natural, and expressive AI communication, demonstrated in live roleplay and coaching scenarios.
Exceptional text-to-speech capabilities with emotion and interaction between multiple voices, as shown in the public demo.
Can be installed locally for safe operation on unconnected devices, ensuring privacy and no cloud dependency.

Trade-offs

Struggles with loops and interjections, leading to conversational incoherence in extended interactions.
Not yet fully functional for practical applications, with a 5-minute session limit in the online demo.
Conversational intelligence is notably lower than closed-source competitors like OpenAI's Advanced Voice Mode.
English-only support and limited robustness for real-world deployment outside research contexts.

Pricing context

Free for non-commercial and commercial uses under CC-BY 4.0, Apache 2.0, and MIT licenses; no paid tiers.

Getting started with Moshi

Visit the Moshi demo page

Open your web browser and navigate to the Kyutai website's Moshi demo page. Click the start button to begin a real-time speech-to-speech conversation. The demo limits sessions to 5 minutes, so plan your test accordingly.
Download model weights and code

Go to the Kyutai GitHub repository or official release page. Download the Moshi model weights and source code, which are available under permissive licenses (CC-BY 4.0, Apache 2.0, MIT). Ensure you have sufficient storage and compute resources.
Install dependencies and set up environment

Set up a Python environment with the required libraries listed in the repository's documentation. Install dependencies such as PyTorch and any audio processing tools. Follow the provided setup script to configure the environment for local execution.
Run the local inference script

Execute the provided inference script to load the Moshi model on your local machine. Use a microphone and speakers for audio input and output. Test the system by speaking naturally and observing its real-time responses.
Experiment with multimodal instructions

Try different conversational scenarios, such as roleplay or emotional speech, by providing textual prompts that guide the model's tone and style. Adjust parameters in the script to explore expressive capabilities and note any limitations in coherence.

Frequently Asked Questions

What is Moshi AI and who created it?

Moshi is an experimental speech-to-speech AI system developed by Kyutai, a Paris-based nonprofit research lab. Released in July 2024, it enables real-time, natural voice conversations without explicit turn-taking and is openly accessible for testing online.

How does Moshi's speech-to-speech architecture work?

Moshi uses the Mimi encoder-decoder and an RQ-Transformer built on the Helium LLM. Mimi encodes audio into eight tokens per 80-millisecond timestep, trained on 7 million hours of English speech. Helium predicts text tokens that guide audio token generation for coherent responses.

Is Moshi AI free to use and open source?

Yes, Moshi is free for both non-commercial and commercial use. Its model weights and code are released under permissive licenses including CC-BY 4.0, Apache 2.0, and MIT, making it openly accessible for researchers and developers.

Can Moshi run locally on my own device?

Yes, Moshi can be installed and run on an unconnected device, enabling safe offline operation without cloud dependency. However, local installation requires significant compute resources, and the model's intelligence is lower than closed-source competitors.

How does Moshi compare to OpenAI's Advanced Voice Mode?

Moshi is an open alternative to proprietary systems like OpenAI's Realtime API and Advanced Voice Mode. While Moshi's weights and code are freely available and can run locally, it lags in conversational intelligence and robustness, struggling with loops and interjections.

What are the main limitations of Moshi for practical use?

Moshi struggles with conversational loops and interjections, leading to incoherence in extended interactions. The online demo has a 5-minute session limit, it supports English only, and its intelligence is notably lower than mature commercial models, making it a prototype for research.

Alternatives in this category

How Moshi compares

Direct head-to-head against 2 competitors. Picked by 7wData.

Pricing: Free for non-commercial and commercial uses under CC-BY 4.0, Apache 2.0, and MIT licenses; no paid tiers.
Target: Moshi is an experimental, end-to-end speech-to-speech AI system developed by Kyutai, a Paris-based nonprofit research lab, and released in July 2024 after six months of
Strength: First voice-enabled AI openly accessible to all, with weights and code freely shared under permissive licenses.
Watch for: Struggles with loops and interjections, leading to conversational incoherence in extended interactions.

Pricing: $69.99/year
Target: Children's sleep and mindfulness
Deployment: Mobile, Web
Strength: Broad mindfulness content
Watch for: Limited clinical validation

Pricing: $69.99/year
Target: Children's mindfulness and sleep
Deployment: Mobile, Web
Strength: Engaging animations
Watch for: Less focus on sleep-specific content

User reviews

No user reviews yet. Be the first to write one.

Sources

Reporting on this tool draws on these publicly available sources.

Moshi

On this page

Publisher review

How it works

Low latency

End-to-end speech-to-speech

Continuous listening and responding

Expressive and spontaneous voice

Multimodal instruction-tuning

Local installation capability

Openly shared weights and code

Strengths and trade-offs

Strengths

Trade-offs

Pricing context

Getting started with Moshi

Frequently Asked Questions

Alternatives in this category

How Moshi compares

Moshi

Calm Kids

Headspace for Kids

User reviews

Sources

Publisher review

Get the AI & data signal, daily.

How it works

Low latency

End-to-end speech-to-speech

Continuous listening and responding

Expressive and spontaneous voice

Multimodal instruction-tuning

Local installation capability

Openly shared weights and code

Strengths and trade-offs

Strengths

Trade-offs

Pricing context

Getting started with Moshi

Frequently Asked Questions

Alternatives in this category

How Moshi compares

User reviews

Sources