Microsoft To Launch VALL-E, A Voice DALL-E

Microsoft has recently released VALL-E, a new language model for text-to-speech synthesis (TTS) that uses audio codec codes to represent intermediate representations. After being trained on 60,000 hours worth of English speech data, it demonstrated in-context learning abilities in zero-shot situations.

VALL-E allows you to create high-quality, personalized speech with just a 3-second recording of an oblique speaker as an acoustic prompt. It allows for prompt-based TTS techniques that are zero-shot and contextual.

There is no need to add structural engineering or pre-designed acoustic features. Microsoft used much semi-supervised information to create a generalized TTS system for the speaker dimension. This indicates that semi-supervised data has not been fully utilized in scaling up TTS.

VALL-E can produce multiple outputs from the same input text while keeping the speaker’s emotion intact and the acoustical prompt. VALL-E can synthesize natural speech using prompting in the zero-shot scenario. Evaluation results show that VALL-E performs better than any zero-shot TTS system on LibriSpeech or VCTK. VALL-E even produced new, state-of-the-art zero-shot TTS results for LibriSpeech & VCTK. You can also read the research paper from here.

Interestingly, those who have lost their voices can talk again using this text-to-speech method, provided they have previously recorded voice recordings.

What Are The Features Of Vall-E?

Synthesis of Diversity: VALLE’s output can vary for the same input text because it generates discrete tokens using the sampling-based algorithm. It can synthesize different samples of personalized speech by using random seeds.

Acoustic Environment Maintenance: VALL-E can generate personalized speech while maintaining the speaker prompt’s audio environment. VALL-E is trained using large-scale data with more acoustic variables than the baseline. We used samples from the Fisher dataset to create the audio and transcriptions.

Speaker’s emotion maintenance: VALL-E uses the Emotional Voices database for audio prompts to build personalized speech and preserve the prompt’s emotional tone. The speech corresponds to a transcription of the emotion label and an emotion label in a Supervised Emotional TTS dataset. This is traditional training. VALL-E can retain the prompt’s emotion in a zero-shot environment.

VALL-E still has to overcome weaknesses like synthesis robustness and data coverage.