announcements · june 3, 2026

Releasing the MisoTTS

Achieving state-of-the-art emotive speech and dialogue generation with a hierarchical RVQ transformer. The Miso TTS is an 8-billion-parameter transformer model with open-source weights available on Hugging Face, and API access coming soon.

Aoden Teo & Cassidy Dalva · 14 min

Voice is the most natural interface for AI. It is intuitive, and carries emotional meaning that text alone cannot. But today’s voice models lack the expression and responsiveness that make human conversation feel natural. Current text-to-speech models fall short for two main reasons: they cannot capture the full range of human speech with a practical token vocabulary, and they usually condition only on text, ignoring the user’s tone.

We introduce MisoTTS, an 8B-parameter model that generates speech from both text and audio context. MisoTTS uses residual vector quantization (RVQ), representing each audio token as $32$ codebook indices over $2048$ -way codebooks, giving approximately $2048^{32}$ possible audio tokens. A 7.7B-parameter backbone models the text-audio sequence and predicts the first codebook index, while a 300M-parameter decoder predicts the remaining indices across RVQ depth. This design lets the model cover a much wider range of speech sounds without scaling a single flat vocabulary, and it lets the model use the user’s speech when generating a response. The result is speech that is more natural, expressive, and context-aware. The current system models individual turns and half-duplex audio; turn-taking and full-duplex conversation remain future work.

Background

Despite their strengths, transformers pose a basic challenge for voice generation: they generate from a fixed vocabulary of discrete tokens. This works well when the target space can be covered by a manageable vocabulary, but human speech is far more varied. It differs across pitch, rhythm, emphasis, emotion, accent, and many other factors. Capturing this range requires access to an enormous space of possible sounds.

The direct way to expand this space is to increase the audio token vocabulary. But in a standard transformer, larger vocabularies require more parameters, since each token needs to be represented and predicted by the model. Scaling the vocabulary far enough to cover the diversity of human speech is therefore not practical. We call this the vocabulary size problem.

continuous waveform

the smooth signal we want to model

quantized with one codebook

256 tokens

16 piecewise-constant segments from V = 256

vocabulary V256

linear LM head

a 2-feature toy head projecting to 256 logits

small head

256 logits in head

2.1M params in 4096 x V

Figure 1Notice that as you scale up the vocabulary, the approximation better fits the curve, but the model (pictured below) becomes extremely large. In practice, the signal that we are modeling (human speech) is way more diverse than this short waveform, and so to properly express it, we would need an even larger vocabulary.

A second challenge is that most text-to-speech models condition only on text. Human speech is heavily conditioned on the interlocutor's tone. For instance, one responds differently to somebody yelling, than somebody whispering. When models ignore this signal, their speech can feel emotionally detached, leading to the 'uncanny valley' effect.¹

The Miso TTS model addresses both of these issues, using Residual Vector Quantization (RVQ) to achieve a vocabulary size of $2048^{ 32 } \approx 10^{105}$ addressable tokens, and processing both audio and text to condition its generations on the user's tone. As shown in the samples below, this produces speech that is substantially more realistic and expressive.

Basketball commentary

0:00

1:04

Fast, excited delivery with live-event pacing.

Casual conversation

0:00

0:09

Conversational timing, asides, and relaxed intonation.

Math explanation

0:00

0:28

Calm instructional speech with clear phrasing.

Therapeutic register

0:00

0:05

Soft, emotionally aware delivery with longer pauses.

Residual Vector Quantization

Residual Vector Quantization (RVQ) was first proposed for solving the vocabulary size problem in the context of image generation by Lee et al. in Autoregressive Image Generation using Residual Quantization,² and was first successfully applied to neural audio coding by Iribe et al. in their Sesame CSM model.¹

The central idea with RVQ is to generate a vector token. Typically, a transformer will generate a single output token index $k$ . The model will then look up the $k$ th sound in its vocabulary, say $v_k$ , and this will be the output.

This has the severe drawback that we can only generate as many sounds as exist in our limited vocabulary. RVQ solves this by designing the model to generate a vector of tokens $[k_1 \cdots k_d]$ . The model maintains $d$ separate vocabularies, one for each position in the vector. To obtain the corresponding output sound from a vector token, we will first look up the corresponding sound for each position in the vector, and then add them together through vector addition.

More precisely, let the vocabulary corresponding to the $j$ th position of the vector be denoted by $C_j$ . Let the $i$ th vector in the vocabulary $C_j$ be denoted by $C_j(i)$ . Then, we have that

\text{Sound}\left(\begin{bmatrix} k_1 \\ \vdots \\ k_d \end{bmatrix}\right) = \sum_{j = 1}^d C_j(k_j).

We will call each of the distinct vocabularies codebooks as is standard in the literature, and the vocabulary corresponding to the $j$ th position in the vector (i.e. $C_j$ ) will be called the $j$ th codebook.

Notice here that we can scale our addressable vocabulary not just in the size of the codebooks, but also in the depth of the vector $d$ . If $V$ is equal to the codebook size, and $d$ the depth of the vector token, then the addressable vocabulary is given by $V^d$ . To increase $V$ , we need to endure a linear increase in parameter count of the model. However, as we will see later, increasing $d$ in an RVQ transformer does not require additional parameters. Since the addressable vocabulary depends exponentially on $d$ , this allows us to massively scale the sonic range of the model while keeping parameter count fixed.

equal slider scale · V = 3 · D = 2 · V^D = 9 segments

continuous waveform

the target signal before quantization

residual reconstruction

each codebook adds another piecewise refinement

codebook V3

depth D2

linear LM head

a 2-feature toy head projecting to 3 logits

small head

3 logits in head

25K params in 4096 x V

Figure 2Drag the sliders. Notice that by increasing the depth d, we can get extremely good approximations with far simpler networks. This is because increasing d imposes no cost on network complexity, but scales addressable vocabulary exponentially.

The Miso TTS 8B model uses a vocabulary size of $2048$ , but with a depth of $32.$ This gives us $2048^{32}$ addressable tokens, which is more tokens than atoms in the observable universe. To achieve this vocabulary size by naive scaling would require a model 93 orders of magnitude larger than the largest models ever trained.

Architecture

Since standard transformers only generate and process scalar tokens, we need a modified architecture to handle vector inputs and outputs. At each token position, a token may either be a text token or a vector. We will maintain an embedding table for the text tokens, as well as a separate embedding table for each vector position of the audio token.

This means we maintain a $d + 1$ embedding tables. To obtain the embedding of a text token, we simply look it up. To obtain the embedding of an audio token, we sum the embeddings of each position in the vector.

\text{Embedding}\left( \begin{bmatrix} k_1 \\ \vdots \\ k_d \end{bmatrix}\right) = \sum_{j = 1}^d E_j(k_j),

where $E_j$ refers to the embedding table corresponding to the $j$ th position in the vector. We use a separate text embedding table for text tokens.

Figure 3Each sequence position is embedded before entering the backbone. Text tokens use a text embedding table; audio vector tokens sum one shared codebook embedding lookup per vector position.

In practice, the implementation is vectorized so that we consider only a single embedding table and all tokens are vectors of size $d+1$ , but these details are not relevant to the architecture itself.

In order to output vector tokens, our architecture splits into two models, a backbone and a decoder. The backbone is a 7.7B parameter transformer model that processes the input embeddings, and outputs the value of $k_1$ , the token index within the first codebook. We will also extract the final hidden state of the backbone model.

\text{Backbone}(s) = (h_0, k_1).

To recover the remaining codebook indices $k_2, \dots, k_d$ , we use a 300M-parameter decoder transformer that runs autoregressively over depth. Given the backbone state $h_0$ and the embedding $E_1(k_1)$ , it predicts each subsequent codebook index conditioned on the ones before it. Note here that the embeddings are the same as used in the backbone.

\begin{aligned} k_2 &= \text{Decoder}\!\left(h_0, E_1(k_1)\right), \\ k_3 &= \text{Decoder}\!\left(h_0, E_1(k_1), E_2(k_2)\right), \\ &\ \vdots \\ k_j &= \text{Decoder}\!\left(h_0, E_1(k_1), \ldots, E_{j-1}(k_{j-1})\right), \\ &\ \vdots \\ k_d &= \text{Decoder}\!\left(h_0, E_1(k_1), \ldots, E_{d-1}(k_{d-1})\right). \end{aligned}

This allows us to reuse the same $300M$ parameters to sample each position in the vector token, which scales our addressable vocabulary exponentially without increasing parameter count.

Miso TTS 8B · two transformers, one vector token

sequence into backbone

ATTA

→

Backbone

7.7B · time

(h_0, k_1)

→

Depth decoder

300M · depth

one frame token

k_1k_2k_3...k_32

7.7B

backbone params

codebook 1, over time

300M

depth transformer

codebooks 2-32, over depth

32 x 2048

per-frame vocab

2048^32 addressable tokens

Figure 4A 7.7B-parameter temporal backbone predicts the first codebook index and hidden state. A smaller depth transformer then reuses the same 300M parameters to autoregressively recover the remaining codebook indices inside the frame.

Note that this also allows the model to process interleaved text and audio tokens, which enables conditioning generations on the conversation history.

Limitations

While the current model can model individual conversation turns, it cannot model the turn-taking of a conversation itself. Furthermore, this model generates half-duplex audio—it cannot speak while the interlocutor is speaking. We believe that solving these remaining problems are crucial for passing the audio Turing test.

Try the model

The model is open source under a modified MIT license, and API access is coming soon on our website.

Footnotes

–
Brendan Iribe, Ankit Kumar, and the Sesame team. Crossing the uncanny valley of conversational voice. Sesame, February 27, 2025. sesame.com/research.
–
Doyup Lee, Chiheon Kim, Saehoon Kim, Minsu Cho, and Wook-Shin Han. Autoregressive Image Generation using Residual Quantization. CVPR 2022. arXiv:2203.01941.

written by

Aoden Teo

Co-Founder and CEO

Cassidy Dalva

Co-Founder and President

cite

@misc{miso-miso-tts-8b,
  author = {Aoden Teo and Cassidy Dalva},
  title  = {Releasing the MisoTTS},
  year   = {2026},
  url    = {misolabs.ai/blog/miso-tts-8b},
}

share on x share on hackernews