Abstract: Recent developments in deep learning have significantly improved the quality of synthesized singing voice audio. However, prominent neural singing voice synthesis systems suffer from slow inference speed due to their autoregressive design. Inspired by MLP-Mixer, a novel architecture introduced in the vision literature for attention-free image classification, we propose MLP Singer, a parallel Korean singing voice synthesis system. To the best of our knowledge, this is the first work that uses an entirely MLP-based architecture for voice synthesis. Listening tests demonstrate that MLP Singer outperforms a larger autoregressive GAN-based system in terms of audio quality. More importantly, MLP Singer achieves a real-time factor of 200 and 3400 on CPUs and GPUs respectively, enabling orders of magnitude faster generation than baseline systems.
Samples used for MOS listening tests
Ground Truth | Vocoder Reconstruction | BEGANSing | MLP Singer | MLP Singer + Overlapped Batching | |
---|---|---|---|---|---|
1a: Little Star | |||||
1b: Little Star | |||||
2a: Bingo | |||||
2b: Bingo | |||||
3a: Head, Shoulders, Knees and Toes | |||||
3b: Head, Shoulders, Knees and Toes | |||||
4a: Ten Little Indians | |||||
4b: Ten Little Indians | |||||
(Update 06/29) Samples produced using enhanced vocoder (improved fine-tuning on ground truth mel)
Ground Truth | Vocoder Reconstruction | BEGANSing | MLP Singer | MLP Singer + Overlapped Batching | |
---|---|---|---|---|---|
1a: Little Star | |||||
1b: Little Star | |||||
2a: Bingo | |||||
2b: Bingo | |||||
3a: Head, Shoulders, Knees and Toes | |||||
3b: Head, Shoulders, Knees and Toes | |||||
4a: Ten Little Indians | |||||
4b: Ten Little Indians | |||||