Releasing the MisoTTS
Achieving state-of-the-art emotive speech and dialogue generation with a hierarchical RVQ transformer. The Miso TTS is an 8-billion-parameter transformer model with open-source weights available on Hugging Face, and API access coming soon.
Voice is the most natural interface for AI. It is intuitive, and carries emotional meaning that text alone cannot. But today’s voice models lack the expression and responsiveness that make human conversation feel natural. Current text-to-speech models fall short for two main reasons: they cannot capture the full range of human speech with a practical token vocabulary, and they usually condition only on text, ignoring the user’s tone.
We introduce MisoTTS, an 8B-parameter model that generates speech from both text and audio context. MisoTTS uses residual vector quantization (RVQ), representing each audio token as codebook indices over -way codebooks, giving approximately possible audio tokens. A 7.7B-parameter backbone models the text-audio sequence and predicts the first codebook index, while a 300M-parameter decoder predicts the remaining indices across RVQ depth. This design lets the model cover a much wider range of speech sounds without scaling a single flat vocabulary, and it lets the model use the user’s speech when generating a response. The result is speech that is more natural, expressive, and context-aware. The current system models individual turns and half-duplex audio; turn-taking and full-duplex conversation remain future work.
Background
Despite their strengths, transformers pose a basic challenge for voice generation: they generate from a fixed vocabulary of discrete tokens. This works well when the target space can be covered by a manageable vocabulary, but human speech is far more varied. It differs across pitch, rhythm, emphasis, emotion, accent, and many other factors. Capturing this range requires access to an enormous space of possible sounds.
The direct way to expand this space is to increase the audio token vocabulary. But in a standard transformer, larger vocabularies require more parameters, since each token needs to be represented and predicted by the model. Scaling the vocabulary far enough to cover the diversity of human speech is therefore not practical. We call this the vocabulary size problem.
A second challenge is that most text-to-speech models condition only on text. Human speech is heavily conditioned on the interlocutor's tone. For instance, one responds differently to somebody yelling, than somebody whispering. When models ignore this signal, their speech can feel emotionally detached, leading to the 'uncanny valley' effect.1
The Miso TTS model addresses both of these issues, using Residual Vector Quantization (RVQ) to achieve a vocabulary size of addressable tokens, and processing both audio and text to condition its generations on the user's tone. As shown in the samples below, this produces speech that is substantially more realistic and expressive.
Basketball commentary
Fast, excited delivery with live-event pacing.
Casual conversation
Conversational timing, asides, and relaxed intonation.
Math explanation
Calm instructional speech with clear phrasing.
Therapeutic register
Soft, emotionally aware delivery with longer pauses.
Residual Vector Quantization
Residual Vector Quantization (RVQ) was first proposed for solving the vocabulary size problem in the context of image generation by Lee et al. in Autoregressive Image Generation using Residual Quantization,2 and was first successfully applied to neural audio coding by Iribe et al. in their Sesame CSM model.1
The central idea with RVQ is to generate a vector token. Typically, a transformer will generate a single output token index . The model will then look up the th sound in its vocabulary, say , and this will be the output.
This has the severe drawback that we can only generate as many sounds as exist in our limited vocabulary. RVQ solves this by designing the model to generate a vector of tokens . The model maintains separate vocabularies, one for each position in the vector. To obtain the corresponding output sound from a vector token, we will first look up the corresponding sound for each position in the vector, and then add them together through vector addition.
More precisely, let the vocabulary corresponding to the th position of the vector be denoted by . Let the th vector in the vocabulary be denoted by . Then, we have that
We will call each of the distinct vocabularies codebooks as is standard in the literature, and the vocabulary corresponding to the th position in the vector (i.e. ) will be called the th codebook.
Notice here that we can scale our addressable vocabulary not just in the size of the codebooks, but also in the depth of the vector . If is equal to the codebook size, and the depth of the vector token, then the addressable vocabulary is given by . To increase , we need to endure a linear increase in parameter count of the model. However, as we will see later, increasing in an RVQ transformer does not require additional parameters. Since the addressable vocabulary depends exponentially on , this allows us to massively scale the sonic range of the model while keeping parameter count fixed.
The Miso TTS 8B model uses a vocabulary size of , but with a depth of This gives us addressable tokens, which is more tokens than atoms in the observable universe. To achieve this vocabulary size by naive scaling would require a model 93 orders of magnitude larger than the largest models ever trained.
Architecture
Since standard transformers only generate and process scalar tokens, we need a modified architecture to handle vector inputs and outputs. At each token position, a token may either be a text token or a vector. We will maintain an embedding table for the text tokens, as well as a separate embedding table for each vector position of the audio token.
This means we maintain a embedding tables. To obtain the embedding of a text token, we simply look it up. To obtain the embedding of an audio token, we sum the embeddings of each position in the vector.
where refers to the embedding table corresponding to the th position in the vector. We use a separate text embedding table for text tokens.
In practice, the implementation is vectorized so that we consider only a single embedding table and all tokens are vectors of size , but these details are not relevant to the architecture itself.
In order to output vector tokens, our architecture splits into two models, a backbone and a decoder. The backbone is a 7.7B parameter transformer model that processes the input embeddings, and outputs the value of , the token index within the first codebook. We will also extract the final hidden state of the backbone model.
To recover the remaining codebook indices , we use a 300M-parameter decoder transformer that runs autoregressively over depth. Given the backbone state and the embedding , it predicts each subsequent codebook index conditioned on the ones before it. Note here that the embeddings are the same as used in the backbone.
This allows us to reuse the same parameters to sample each position in the vector token, which scales our addressable vocabulary exponentially without increasing parameter count.
Note that this also allows the model to process interleaved text and audio tokens, which enables conditioning generations on the conversation history.
Limitations
While the current model can model individual conversation turns, it cannot model the turn-taking of a conversation itself. Furthermore, this model generates half-duplex audio—it cannot speak while the interlocutor is speaking. We believe that solving these remaining problems are crucial for passing the audio Turing test.
Try the model
The model is open source under a modified MIT license, and API access is coming soon on our website.
Footnotes
- –
Brendan Iribe, Ankit Kumar, and the Sesame team. Crossing the uncanny valley of conversational voice. Sesame, February 27, 2025. sesame.com/research.
- –
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive Image Generation using Residual Quantization. CVPR 2022. arXiv:2203.01941.
@misc{miso-miso-tts-8b,
author = {Aoden Teo and Cassidy Dalva},
title = {Releasing the MisoTTS},
year = {2026},
url = {misolabs.ai/blog/miso-tts-8b},
}