Audio samples from "MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis"

Abstract: Recent developments in deep learning have significantly improved the quality of synthesized singing voice audio. However, prominent neural singing voice synthesis systems suffer from slow inference speed due to their autoregressive design. Inspired by MLP-Mixer, a novel architecture introduced in the vision literature for attention-free image classification, we propose MLP Singer, a parallel Korean singing voice synthesis system. To the best of our knowledge, this is the first work that uses an entirely MLP-based architecture for voice synthesis. Listening tests demonstrate that MLP Singer outperforms a larger autoregressive GAN-based system in terms of audio quality. More importantly, MLP Singer achieves a real-time factor of 200 and 3400 on CPUs and GPUs respectively, enabling orders of magnitude faster generation than baseline systems.

Samples used for MOS listening tests

Ground Truth	Vocoder Reconstruction	BEGANSing	MLP Singer	MLP Singer + Overlapped Batching
1a: Little Star

1b: Little Star

2a: Bingo

2b: Bingo

3a: Head, Shoulders, Knees and Toes

3b: Head, Shoulders, Knees and Toes

4a: Ten Little Indians

4b: Ten Little Indians

(Update 06/29) Samples produced using enhanced vocoder (improved fine-tuning on ground truth mel)

Ground Truth	Vocoder Reconstruction	BEGANSing	MLP Singer	MLP Singer + Overlapped Batching
1a: Little Star

1b: Little Star

2a: Bingo

2b: Bingo

3a: Head, Shoulders, Knees and Toes

3b: Head, Shoulders, Knees and Toes

4a: Ten Little Indians

4b: Ten Little Indians