Audio Samples from "TriniTTS: Pitch-controllable End-to-end TTS without External Aligner"

Abstract:

"Three research directions that have recently advanced the text-to-speech (TTS) field are end-to-end architecture, prosody control modeling, and on-the-fly duration alignment of non-auto-regressive models. However, these three agendas have yet to be tackled at once in a single solution. Current studies are limited either by a lack of control over prosody modeling or by the inefficient training inherent in building a two-stage TTS pipeline. We propose TriniTTS, a pitch-controllable end-to-end TTS without an external aligner that generates natural speech by addressing the issues mentioned above at once. It eliminates the training inefficiency in the two-stage TTS pipeline by the end-to-end architecture. Moreover, it manages to learn the latent vector representing the data distribution of the speeches through performing tasks (alignment search, pitch estimation, waveform generation) simultaneously. Experimental results demonstrate that TriniTTS enables prosody modeling with user input parameters to generate deterministic speech, while synthesizing comparable speech to the state-of-the-art VITS. Furthermore, eliminating normalizing flow modules used in VITS increases the inference speed by 28.84% in CPU environment and by 29.16% in GPU environment."

Diagram


diagram

Contents

Single speaker (LJSpeech Dataset)

Without pitch control


Text

The Secret Service believed that it was very doubtful that any President would ride regularly in a vehicle with a fixed top, even though transparent. Oswald demonstrated his thinking in connection with his return to the United States by preparing two sets of identical questions of the type which he might have thought. has confidence in the dedicated Secret Service men who are ready to lay down their lives for him. Yet the public opinion of the whole body seems to have checked dissipation.
Ground Truth
Fastpitch + HiFi-GAN
Glowtts + HiFi-GAN
VITS
TriniTTS

Pitch shift


The Secret Service believed that it was very doubtful that any President would ride regularly in a vehicle with a fixed top, even though transparent.

Model Ground truth Fastpitch+Hifi-gan TriniTTS
Pitch shift = -40 (Hz)
Pitch shift = 0 (Hz)
Pitch shift = +40 (Hz)

the Presidential limousine arrived at the emergency entrance of the Parkland Hospital at about twelve:thirty-five p.m.

Model Ground truth Fastpitch+Hifi-gan TriniTTS
Pitch shift = -40 (Hz)
Pitch shift = 0 (Hz)
Pitch shift = +40 (Hz)

that not more than one bottle of wine or one quart of beer could be issued at one time. No account was taken of the amount of liquors admitted in one day,

Model Ground truth Fastpitch+Hifi-gan TriniTTS
Pitch shift = -40 (Hz)
Pitch shift = 0 (Hz)
Pitch shift = +40 (Hz)

Multi-speaker (VCTK Dataset)

Without pitch control


Text

That was something else. It is dangerous and it is a lie. It is quite simple. His signature is his handwriting. For starters, many of the Scotland team didn't turn up.
Ground Truth
Fastpitch+HiFi-GAN ep.500
Fastpitch+HiFi-GAN ep.1000
VITS pretrained (800K steps)
TriniTTS ep.50
TriniTTS ep.500
TriniTTS ep.1000

Pitch shift


Harry Potter has lost his magic.

Model Ground truth Fastpitch+Hifi-gan TriniTTS
Pitch shift = -40 (Hz)
Pitch shift = 0 (Hz)
Pitch shift = +40 (Hz)

We will pay their bills.

Model Ground truth Fastpitch+Hifi-gan TriniTTS
Pitch shift = -40 (Hz)
Pitch shift = 0 (Hz)
Pitch shift = +40 (Hz)

I was left-handed, but it was just a matter of practice.

Model Ground truth Fastpitch+Hifi-gan TriniTTS
Pitch shift = -40 (Hz)
Pitch shift = 0 (Hz)
Pitch shift = +40 (Hz)