Text-to-Speech Synthesis Using Diffusion Bridge Model

A model that outperforms autoregressive and diffusion models for high quality output that is structured, noiseless, and quick on inference.

Text-to-speech (TTS) synthesis technology aims to convert written text into natural, human-like speech. Traditional approaches, such as autoregressive models [25, 49], have limitations, including generated speech that sounds like machine speech. Diffusion models [8, 22, 28, 48, 38, 46, 60] have difficulty in real-time applications. In contrast, the diffusion bridge model offers a powerful alternative. The diffusion bridge model excels in producing high-quality output in a more structured, noiseless format, improving upon existing diffusion-based methods.

Why use Diffusion Bridge Models

Better Generation Quality: Compared to autoregressive models, diffusionbased methods offer superior control over speech synthesis due to their ability to operate in parallel and avoid cumulative errors.

Real-Time Performance: The Bridge Model offers significant advantages by working with a clean prior distribution and a data-to-data process, enhancing real-time TTS applications without degradation in quality.

Challenges in TTS: Existing models like SV2TTS [25] and FastSpeech [49], which use transformer-based architectures, often suffer from artifacts in generated speech, leading to less natural results. Traditional Diffusion Models employ a two-step process: a forward process that converts data into Gaussian noise and a reverse process that reconstructs the data from noise. However, in TTS, this leads to limitations due to the pre-defined noisy priors, which provide little structural information about the target output. These gap highlights the need for more innovative approaches like Diffusion Bridge models.

Method

Diffusion Bridge Model The proposed model, Bridge-TTS, replaces the noisy Gaussian priors with a clean deterministic prior derived from text latent representations. The prior provides structural information about the target, improving quality and efficiency. Instead of a data-to-noise process, the model follows a data-to-data path.

Here, we show theoretical analyses through Schrodinger’s bridge. Bridge-TTS uses a text encoder to transform input text into a latent representation. The Schrodinger Bridge model then generates the mel-spectrogram (Mel) from this latent representation using stochastic or deterministic samplers.

Let the latent representation of the text be denoted by z, and let the target Mel be denoted by x_0. The prior distribution is deterministic, conditioned on the text input:

x_T = z

The Bridge model forms a tractable Schrodinger bridge between x_T and x_0, where the forward stochastic differential equation (SDE) is described by:

dx_t = f(x_t, t)dt + g(t)dw_t

and the reverse process follows:

dx_t = [f(x_t, t) - g^2(t) \nabla \log p_t(x_t)]dt + g(t) dw_t

The clean deterministic prior z enhances the model’s ability to map the text input to the Mel efficiently.

Sampling ODE Sampler is a recent fast sampling method in image generation, achieving one-step generation with the distillation technique in consistency models [52]. Our model uses a Bridge-SDE or Bridge-ODE for sampling. The forward SDE describes how the latent prior evolves to the target Mel:

dx_t = \left[ f(t)x_t + g^2(t) \frac{x_t - \alpha_t x_\theta(x_t, t)}{\alpha_t^2 \sigma_t^2} \right] dt + g(t) d\bar{w}_t

The deterministic form (ODE) is given by:

dx_t = \left[ f(t)x_t - \frac{1}{2}g^2(t)\frac{x_t - \alpha_t x_\theta(x_t, t)}{\alpha_t^2 \sigma_t^2} \right] dt

Here, x_θ is a prediction network trained to recover x0 from xt.

Through our proposed methodology, we primarily accomplish these three things. Superior Generation Quality Outperforms diffusion models like Grad-TTS and transformer-based methods like FastSpeech. Efficient Sampling Achieves higher quality and faster inference times through data-to-data processes. and Real-Time Potential Bridge-TTS can deliver high-quality TTS outputs in few steps, enabling real-time applications.

Comparison with Diffusion-Based TTS, The Bridge-TTS system significantly outperforms traditional diffusion models such as Grad-TTS. For example, in a 1000-step process, Bridge-TTS achieves higher MOS (Mean Opinion Score) and lower real-time factor, maintaining superior speech synthesis quality even in reduced-step scenarios. Additionally, in a 2-step generation, it outperforms state-of-the-art models like FastSpeech 2 [48].

The use of Diffusion Bridge Models in TTS presents a novel, more efficient method for generating natural and high-quality speech. The Bridge-TTS system leverages the strength of clean, deterministic priors from text representations to achieve superior performance compared to both diffusion and autoregressive models. This approach not only improves the speech synthesis quality but also accelerates the sampling process, making real-time, high-fidelity TTS feasible.

PreviousVideo Agent Generation: Diffusion Based Model NextReal-time Conversational Generation: A Framework for Voice-driven Facial Animation

Last updated 8 months ago