Moshi

Moshi is an experimental, end-to-end speech-to-speech AI system developed by Kyutai, a Paris-based nonprofit research lab, and released in July 2024 after six months of work by a team of eight.

Reviewed by 7wData

On this page

Publisher review

Moshi is an experimental, end-to-end speech-to-speech AI system developed by Kyutai, a Paris-based nonprofit research lab, and released in July 2024 after six months of work by a team of eight. It is designed for researchers, developers, and enthusiasts who need a real-time voice AI that can listen and speak continuously without explicit turn-taking. Unlike typical assistants, Moshi is a prototype for advancing natural, expressive spoken interaction and is openly accessible for testing online from the Kyutai website, with models and code shared under permissive licenses (CC-BY 4.0, Apache 2.0, MIT). It targets those exploring low-latency conversational AI, local deployment on unconnected devices, or open alternatives to proprietary real-time speech APIs.

Moshi's architecture combines the Mimi encoder-decoder and an RQ-Transformer built on the Helium LLM. Mimi encodes audio into eight tokens per 80-millisecond timestep, trained on 7 million hours of English speech, using a discriminator loss and knowledge distillation from WavLM. Helium, a 7B-parameter transformer, was trained on 2.1 trillion text tokens (12.5% from Wikipedia) to predict text tokens, which then guide an additional transformer in predicting the next audio token. This end-to-end design achieves very low latency—enabling natural conversations limited to 5 minutes in the online demo—and handles overlapping speech (up to 20% of conversation) without being disrupted by interjections like “uh-huh.” The system also supports multimodal instruction-tuning, allowing expressive roleplay and emotional text-to-speech.

As the first openly accessible real-time voice AI, Moshi positions itself as an open alternative to proprietary systems like OpenAI's Realtime API and ChatGPT's Advanced Voice Mode. While OpenAI's offerings are closed-source and require cloud access, Moshi's weights and code are freely available, and it can run locally on unconnected devices. However, it lags behind commercial systems in practical robustness: it struggles with loops and interjections, and its conversational intelligence is not yet at the level of more mature models like Sesame AI's CSM. Kyutai's nonprofit status and open release aim to democratize voice AI research, but Moshi remains a prototype rather than a production-ready tool.

Honest trade-offs: Moshi's low latency and open licensing come at the cost of limited practical functionality. It is not yet fully functional for real-world applications, as it struggles with conversational loops and interjections that can derail coherence. The 5-minute session limit in the demo restricts extended use, and its English-only support narrows accessibility. Local installation requires significant compute resources, and the model's intelligence is notably lower than that of closed-source competitors. For researchers, these trade-offs are acceptable for studying real-time speech architectures; for developers seeking a drop-in voice assistant, they are prohibitive.

Get the AI & data signal, daily.

335k+ subscribers read this every morning. One email, both newsletters. Unsubscribe anytime.

How it works

  1. Low latency

    Achieves very low latency for real-time conversation, enabling natural back-and-forth with minimal delay in the online demo.

  2. End-to-end speech-to-speech

    Uses a single integrated model to process audio input and generate audio output without separate ASR or TTS stages.

  3. Continuous listening and responding

    Always listens and generates sound, including silence, handling overlapping speech like interjections without explicit turn-taking.

  4. Expressive and spontaneous voice

    Supports emotional text-to-speech and roleplay, with the ability to convey hesitation, cut-offs, and other spoken nuances.

  5. Multimodal instruction-tuning

    Trained to align audio and text modalities, allowing the LLM's text predictions to inform audio generation for coherent responses.

  6. Local installation capability

    Can be installed and run on an unconnected device, enabling safe offline operation without cloud dependency.

  7. Openly shared weights and code

    Model weights and code are released under permissive licenses (CC-BY 4.0, Apache 2.0, MIT) for non-commercial and commercial use.

Strengths and trade-offs

Strengths

  • First voice-enabled AI openly accessible to all, with weights and code freely shared under permissive licenses.
  • World-first technology for smooth, natural, and expressive AI communication, demonstrated in live roleplay and coaching scenarios.
  • Exceptional text-to-speech capabilities with emotion and interaction between multiple voices, as shown in the public demo.
  • Can be installed locally for safe operation on unconnected devices, ensuring privacy and no cloud dependency.

Trade-offs

  • Struggles with loops and interjections, leading to conversational incoherence in extended interactions.
  • Not yet fully functional for practical applications, with a 5-minute session limit in the online demo.
  • Conversational intelligence is notably lower than closed-source competitors like OpenAI's Advanced Voice Mode.
  • English-only support and limited robustness for real-world deployment outside research contexts.

Pricing context

Free for non-commercial and commercial uses under CC-BY 4.0, Apache 2.0, and MIT licenses; no paid tiers.

Getting started with Moshi

  1. Visit the Moshi demo page

    Open your web browser and navigate to the Kyutai website's Moshi demo page. Click the start button to begin a real-time speech-to-speech conversation. The demo limits sessions to 5 minutes, so plan your test accordingly.

  2. Download model weights and code

    Go to the Kyutai GitHub repository or official release page. Download the Moshi model weights and source code, which are available under permissive licenses (CC-BY 4.0, Apache 2.0, MIT). Ensure you have sufficient storage and compute resources.

  3. Install dependencies and set up environment

    Set up a Python environment with the required libraries listed in the repository's documentation. Install dependencies such as PyTorch and any audio processing tools. Follow the provided setup script to configure the environment for local execution.

  4. Run the local inference script

    Execute the provided inference script to load the Moshi model on your local machine. Use a microphone and speakers for audio input and output. Test the system by speaking naturally and observing its real-time responses.

  5. Experiment with multimodal instructions

    Try different conversational scenarios, such as roleplay or emotional speech, by providing textual prompts that guide the model's tone and style. Adjust parameters in the script to explore expressive capabilities and note any limitations in coherence.

Frequently Asked Questions

What is Moshi AI and who created it?

Moshi is an experimental speech-to-speech AI system developed by Kyutai, a Paris-based nonprofit research lab. Released in July 2024, it enables real-time, natural voice conversations without explicit turn-taking and is openly accessible for testing online.

How does Moshi's speech-to-speech architecture work?

Moshi uses the Mimi encoder-decoder and an RQ-Transformer built on the Helium LLM. Mimi encodes audio into eight tokens per 80-millisecond timestep, trained on 7 million hours of English speech. Helium predicts text tokens that guide audio token generation for coherent responses.

Is Moshi AI free to use and open source?

Yes, Moshi is free for both non-commercial and commercial use. Its model weights and code are released under permissive licenses including CC-BY 4.0, Apache 2.0, and MIT, making it openly accessible for researchers and developers.

Can Moshi run locally on my own device?

Yes, Moshi can be installed and run on an unconnected device, enabling safe offline operation without cloud dependency. However, local installation requires significant compute resources, and the model's intelligence is lower than closed-source competitors.

How does Moshi compare to OpenAI's Advanced Voice Mode?

Moshi is an open alternative to proprietary systems like OpenAI's Realtime API and Advanced Voice Mode. While Moshi's weights and code are freely available and can run locally, it lags in conversational intelligence and robustness, struggling with loops and interjections.

What are the main limitations of Moshi for practical use?

Moshi struggles with conversational loops and interjections, leading to incoherence in extended interactions. The online demo has a 5-minute session limit, it supports English only, and its intelligence is notably lower than mature commercial models, making it a prototype for research.

Alternatives in this category

How Moshi compares

Direct head-to-head against 2 competitors. Picked by 7wData.

This tool

Moshi

Pricing
Free for non-commercial and commercial uses under CC-BY 4.0, Apache 2.0, and MIT licenses; no paid tiers.
Target
Moshi is an experimental, end-to-end speech-to-speech AI system developed by Kyutai, a Paris-based nonprofit research lab, and released in July 2024 after six months of
Strength
First voice-enabled AI openly accessible to all, with weights and code freely shared under permissive licenses.
Watch for
Struggles with loops and interjections, leading to conversational incoherence in extended interactions.

Calm Kids

Pricing
$69.99/year
Target
Children's sleep and mindfulness
Deployment
Mobile, Web
Strength
Broad mindfulness content
Watch for
Limited clinical validation

Headspace for Kids

Pricing
$69.99/year
Target
Children's mindfulness and sleep
Deployment
Mobile, Web
Strength
Engaging animations
Watch for
Less focus on sleep-specific content

User reviews

No user reviews yet. Be the first to write one.

Sources

Reporting on this tool draws on these publicly available sources.

  1. www.reddit.com
  2. www.linkedin.com
  3. kyutai.org
  4. kyutai.org
  5. www.deeplearning.ai
  6. github.com