Demo page of "Exploiting Emotion Information in Speaker Embeddings for Expressive Text-to-Speech"

Abstract

Text-to-Speech (TTS) systems have recently seen great progress in synthesizing high-quality speech. However, the prosody of generated utterances often is not as diverse as prosody of the natural speech. In the case of multi-speaker or voice cloning systems, this problem becomes even worse as information about prosody may be present in the input text and the speaker embedding. In this paper, we study the phenomenon of the presence of emotional information in speaker embeddings recently revealed for i-vectors and x-vectors. We show that the produced embeddings may include devoted components encoding prosodic information. We further propose a technique for finding such components and generating emotional speaker embeddings by manipulating them. We then demonstrate that the emotional TTS system based on the proposed method shows good performance and has a smaller number of trained parameters compared to solutions based on fine-tuning.

Samples

Here we provide audio samples used in the MOS test for systems comparison for different emotions.

Models included in comparison:

Baseline - Non-attentive Tacotron (NAT) with prosody and pitch predictors (without emotion control),
EM (our) - NAT with embedding manipulation,
GST - Tacotron with global style tokens

Text 1: In which fox loses a tail and its elder sister finds one.

Text 2: All smile were real and the happier the more sincere.

Additional speakers

March 2023