
Moshi
About
Moshi is an innovative family of speech-to-speech generation models developed by Kyutai, a French AI research laboratory. This multimodal foundation model is designed for real-time, full-duplex conversations, allowing it to listen and speak simultaneously. Unlike conventional systems that separate speech recognition, text processing, and speech generation, Moshi integrates these functionalities into a unified model. This streamlining results in a natural, expressive conversational experience with reduced latency, boasting a theoretical latency of 160ms and a practical latency of 200ms. The model leverages a 7-billion parameter language model called Helium and employs the Mimi audio codec for efficient audio processing. It also features the "Inner Monologue" method to enhance the linguistic quality of speech by predicting text tokens before generating audio. Variants like Moshiko and Moshika serve as fine-tuned versions for specific tasks or datasets. 4 8 11.