← back to blog
announcements · june 3, 2026

Releasing the MisoTTS

Achieving state-of-the-art emotive speech and dialogue generation with a hierarchical RVQ transformer. The Miso TTS is an 8-billion-parameter transformer model with open-source weights available on Hugging Face, and API access coming soon.

AT
CD
Aoden Teo & Cassidy Dalva · 14 min

Voice is the most natural interface for AI. It is intuitive, and carries emotional meaning that text alone cannot. But today’s voice models lack the expression and responsiveness that make human conversation feel natural. Current text-to-speech models fall short for two main reasons: they cannot capture the full range of human speech with a practical token vocabulary, and they usually condition only on text, ignoring the user’s tone.

We introduce MisoTTS, an 8B-parameter model that generates speech from both text and audio context. MisoTTS uses residual vector quantization (RVQ), representing each audio token as 3232 codebook indices over 20482048-way codebooks, giving approximately 2048322048^{32} possible audio tokens. A 7.7B-parameter backbone models the text-audio sequence and predicts the first codebook index, while a 300M-parameter decoder predicts the remaining indices across RVQ depth. This design lets the model cover a much wider range of speech sounds without scaling a single flat vocabulary, and it lets the model use the user’s speech when generating a response. The result is speech that is more natural, expressive, and context-aware. The current system models individual turns and half-duplex audio; turn-taking and full-duplex conversation remain future work.

Background

Despite their strengths, transformers pose a basic challenge for voice generation: they generate from a fixed vocabulary of discrete tokens. This works well when the target space can be covered by a manageable vocabulary, but human speech is far more varied. It differs across pitch, rhythm, emphasis, emotion, accent, and many other factors. Capturing this range requires access to an enormous space of possible sounds.

The direct way to expand this space is to increase the audio token vocabulary. But in a standard transformer, larger vocabularies require more parameters, since each token needs to be represented and predicted by the model. Scaling the vocabulary far enough to cover the diversity of human speech is therefore not practical. We call this the vocabulary size problem.

continuous waveform
the smooth signal we want to model
quantized with one codebook
256 tokens
16 piecewise-constant segments from V = 256
vocabulary V256
linear LM head
a 2-feature toy head projecting to 256 logits
small head
256 logits in head
2.1M params in 4096 x V
Figure 1Notice that as you scale up the vocabulary, the approximation better fits the curve, but the model (pictured below) becomes extremely large. In practice, the signal that we are modeling (human speech) is way more diverse than this short waveform, and so to properly express it, we would need an even larger vocabulary.

A second challenge is that most text-to-speech models condition only on text. Human speech is heavily conditioned on the interlocutor's tone. For instance, one responds differently to somebody yelling, than somebody whispering. When models ignore this signal, their speech can feel emotionally detached, leading to the 'uncanny valley' effect.1

The Miso TTS model addresses both of these issues, using Residual Vector Quantization (RVQ) to achieve a vocabulary size of 204832101052048^{ 32 } \approx 10^{105} addressable tokens, and processing both audio and text to condition its generations on the user's tone. As shown in the samples below, this produces speech that is substantially more realistic and expressive.

Basketball commentary

0:00
1:04

Fast, excited delivery with live-event pacing.

Casual conversation

0:00
0:09

Conversational timing, asides, and relaxed intonation.

Math explanation

0:00
0:28

Calm instructional speech with clear phrasing.

Therapeutic register

0:00
0:05

Soft, emotionally aware delivery with longer pauses.

Residual Vector Quantization

Residual Vector Quantization (RVQ) was first proposed for solving the vocabulary size problem in the context of image generation by Lee et al. in Autoregressive Image Generation using Residual Quantization,2 and was first successfully applied to neural audio coding by Iribe et al. in their Sesame CSM model.1

The central idea with RVQ is to generate a vector token. Typically, a transformer will generate a single output token index kk. The model will then look up the kkth sound in its vocabulary, say vkv_k, and this will be the output.

This has the severe drawback that we can only generate as many sounds as exist in our limited vocabulary. RVQ solves this by designing the model to generate a vector of tokens [k1kd][k_1 \cdots k_d]. The model maintains dd separate vocabularies, one for each position in the vector. To obtain the corresponding output sound from a vector token, we will first look up the corresponding sound for each position in the vector, and then add them together through vector addition.

More precisely, let the vocabulary corresponding to the jjth position of the vector be denoted by CjC_j. Let the iith vector in the vocabulary CjC_j be denoted by Cj(i)C_j(i). Then, we have that

Sound([k1kd])=j=1dCj(kj).\text{Sound}\left(\begin{bmatrix} k_1 \\ \vdots \\ k_d \end{bmatrix}\right) = \sum_{j = 1}^d C_j(k_j).

We will call each of the distinct vocabularies codebooks as is standard in the literature, and the vocabulary corresponding to the jjth position in the vector (i.e. CjC_j) will be called the jjth codebook.

Notice here that we can scale our addressable vocabulary not just in the size of the codebooks, but also in the depth of the vector dd. If VV is equal to the codebook size, and dd the depth of the vector token, then the addressable vocabulary is given by VdV^d. To increase VV, we need to endure a linear increase in parameter count of the model. However, as we will see later, increasing dd in an RVQ transformer does not require additional parameters. Since the addressable vocabulary depends exponentially on dd, this allows us to massively scale the sonic range of the model while keeping parameter count fixed.

equal slider scale · V = 3 · D = 2 · V^D = 9 segments
continuous waveform
the target signal before quantization
residual reconstruction
each codebook adds another piecewise refinement
codebook V3
depth D2
linear LM head
a 2-feature toy head projecting to 3 logits
small head
3 logits in head
25K params in 4096 x V
Figure 2Drag the sliders. Notice that by increasing the depth d, we can get extremely good approximations with far simpler networks. This is because increasing d imposes no cost on network complexity, but scales addressable vocabulary exponentially.

The Miso TTS 8B model uses a vocabulary size of 20482048, but with a depth of 32.32. This gives us 2048322048^{32} addressable tokens, which is more tokens than atoms in the observable universe. To achieve this vocabulary size by naive scaling would require a model 93 orders of magnitude larger than the largest models ever trained.

Architecture

Since standard transformers only generate and process scalar tokens, we need a modified architecture to handle vector inputs and outputs. At each token position, a token may either be a text token or a vector. We will maintain an embedding table for the text tokens, as well as a separate embedding table for each vector position of the audio token.

This means we maintain a d+1d + 1 embedding tables. To obtain the embedding of a text token, we simply look it up. To obtain the embedding of an audio token, we sum the embeddings of each position in the vector.

Embedding([k1kd])=j=1dEj(kj),\text{Embedding}\left( \begin{bmatrix} k_1 \\ \vdots \\ k_d \end{bmatrix}\right) = \sum_{j = 1}^d E_j(k_j),

where EjE_j refers to the embedding table corresponding to the jjth position in the vector. We use a separate text embedding table for text tokens.

audio vector token: sum of lookupsk_1k_2...k_dE_1(k_1)E_2(k_2)...E_d(k_d)+embedded sequence s
Figure 3Each sequence position is embedded before entering the backbone. Text tokens use a text embedding table; audio vector tokens sum one shared codebook embedding lookup per vector position.

In practice, the implementation is vectorized so that we consider only a single embedding table and all tokens are vectors of size d+1d+1, but these details are not relevant to the architecture itself.

In order to output vector tokens, our architecture splits into two models, a backbone and a decoder. The backbone is a 7.7B parameter transformer model that processes the input embeddings, and outputs the value of k1k_1, the token index within the first codebook. We will also extract the final hidden state of the backbone model.

Backbone(s)=(h0,k1).\text{Backbone}(s) = (h_0, k_1).

To recover the remaining codebook indices k2,,kdk_2, \dots, k_d, we use a 300M-parameter decoder transformer that runs autoregressively over depth. Given the backbone state h0h_0 and the embedding E1(k1)E_1(k_1), it predicts each subsequent codebook index conditioned on the ones before it. Note here that the embeddings are the same as used in the backbone.

k2=Decoder ⁣(h0,E1(k1)),k3=Decoder ⁣(h0,E1(k1),E2(k2)), kj=Decoder ⁣(h0,E1(k1),,Ej1(kj1)), kd=Decoder ⁣(h0,E1(k1),,Ed1(kd1)).\begin{aligned} k_2 &= \text{Decoder}\!\left(h_0, E_1(k_1)\right), \\ k_3 &= \text{Decoder}\!\left(h_0, E_1(k_1), E_2(k_2)\right), \\ &\ \vdots \\ k_j &= \text{Decoder}\!\left(h_0, E_1(k_1), \ldots, E_{j-1}(k_{j-1})\right), \\ &\ \vdots \\ k_d &= \text{Decoder}\!\left(h_0, E_1(k_1), \ldots, E_{d-1}(k_{d-1})\right). \end{aligned}

This allows us to reuse the same 300M300M parameters to sample each position in the vector token, which scales our addressable vocabulary exponentially without increasing parameter count.

Miso TTS 8B · two transformers, one vector token
sequence into backbone
ATTA
Backbone
7.7B · time
(h_0, k_1)
Depth decoder
300M · depth
one frame token
k_1k_2k_3...k_32
Figure 4A 7.7B-parameter temporal backbone predicts the first codebook index and hidden state. A smaller depth transformer then reuses the same 300M parameters to autoregressively recover the remaining codebook indices inside the frame.

Note that this also allows the model to process interleaved text and audio tokens, which enables conditioning generations on the conversation history.

Limitations

While the current model can model individual conversation turns, it cannot model the turn-taking of a conversation itself. Furthermore, this model generates half-duplex audio—it cannot speak while the interlocutor is speaking. We believe that solving these remaining problems are crucial for passing the audio Turing test.

Try the model

The model is open source under a modified MIT license, and API access is coming soon on our website.

Footnotes

  1. Brendan Iribe, Ankit Kumar, and the Sesame team. Crossing the uncanny valley of conversational voice. Sesame, February 27, 2025. sesame.com/research.

  2. Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive Image Generation using Residual Quantization. CVPR 2022. arXiv:2203.01941.

written by
AT
Aoden Teo
Co-Founder and CEO
CD
Cassidy Dalva
Co-Founder and President
cite
@misc{miso-miso-tts-8b,
  author = {Aoden Teo and Cassidy Dalva},
  title  = {Releasing the MisoTTS},
  year   = {2026},
  url    = {misolabs.ai/blog/miso-tts-8b},
}