Audio samples from "MLP Singer: Towards Rapid Parallel Korean Singing Voice Synthesis"

Abstract: Recent developments in deep learning have significantly improved the quality of synthesized singing voice audio. However, prominent neural singing voice synthesis systems suffer from slow inference speed due to their autoregressive design. Inspired by MLP-Mixer, a novel architecture introduced in the vision literature for attention-free image classification, we propose MLP Singer, a parallel Korean singing voice synthesis system. To the best of our knowledge, this is the first work that uses an entirely MLP-based architecture for voice synthesis. Listening tests demonstrate that MLP Singer outperforms a larger autoregressive GAN-based system in terms of audio quality. More importantly, MLP Singer achieves a real-time factor of 200 and 3400 on CPUs and GPUs respectively, enabling orders of magnitude faster generation than baseline systems.

Samples used for MOS listening tests

Ground Truth Vocoder Reconstruction BEGANSing MLP Singer MLP Singer + Overlapped Batching
1a: Little Star
1b: Little Star
2a: Bingo
2b: Bingo
3a: Head, Shoulders, Knees and Toes
3b: Head, Shoulders, Knees and Toes
4a: Ten Little Indians
4b: Ten Little Indians

(Update 06/29) Samples produced using enhanced vocoder (improved fine-tuning on ground truth mel)

Ground Truth Vocoder Reconstruction BEGANSing MLP Singer MLP Singer + Overlapped Batching
1a: Little Star
1b: Little Star
2a: Bingo
2b: Bingo
3a: Head, Shoulders, Knees and Toes
3b: Head, Shoulders, Knees and Toes
4a: Ten Little Indians
4b: Ten Little Indians