Neural codec language models are zero-shot text to speech synthesizers C Wang, S Chen, Y Wu, Z Zhang, L Zhou, S Liu, Z Chen, Y Liu, H Wang, ... arXiv preprint arXiv:2301.02111, 2023 | 646 | 2023 |
Learning latent representations for style control and transfer in end-to-end speech synthesis YJ Zhang, S Pan, L He, ZH Ling ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and …, 2019 | 309 | 2019 |
NaturalSpeech: End-to-End Text-to-Speech Synthesis With Human-Level Quality X Tan, J Chen, H Liu, J Cong, C Zhang, Y Liu, X Wang, Y Leng, Y Yi, L He, ... IEEE Transactions on Pattern Analysis and Machine Intelligence 46 (6), 4234-4245, 2024 | 229 | 2024 |
Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers K Shen, Z Ju, X Tan, Y Liu, Y Leng, L He, T Qin, S Zhao, J Bian arXiv preprint arXiv:2304.09116, 2023 | 229 | 2023 |
Part-of-speech tagging with bidirectional long short-term memory recurrent neural network P Wang, Y Qian, FK Soong, L He, H Zhao arXiv preprint arXiv:1510.06168, 2015 | 169 | 2015 |
Speak foreign languages with your own voice: Cross-lingual neural codec language modeling Z Zhang, L Zhou, C Wang, S Chen, Y Wu, S Liu, Z Chen, Y Liu, H Wang, ... arXiv preprint arXiv:2303.03926, 2023 | 166 | 2023 |
Multi-speaker modeling and speaker adaptation for DNN-based TTS synthesis Y Fan, Y Qian, FK Soong, L He 2015 IEEE international conference on acoustics, speech and signal …, 2015 | 164 | 2015 |
Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models Z Ju, Y Wang, K Shen, X Tan, D Xin, D Yang, Y Liu, Y Leng, K Song, ... arXiv preprint arXiv:2403.03100, 2024 | 145 | 2024 |
A unified tagging solution: Bidirectional lstm recurrent neural network with word embedding P Wang, Y Qian, FK Soong, L He, H Zhao arXiv preprint arXiv:1511.00215, 2015 | 124 | 2015 |
Developing RNN-T models surpassing high-performance hybrid models with customization capability J Li, R Zhao, Z Meng, Y Liu, W Wei, S Parthasarathy, V Mazalov, Z Wang, ... arXiv preprint arXiv:2007.15188, 2020 | 119 | 2020 |
Robust sequence-to-sequence acoustic modeling with stepwise monotonic attention for neural TTS M He, Y Deng, L He arXiv preprint arXiv:1906.00672, 2019 | 101 | 2019 |
Conversational end-to-end tts for voice agents H Guo, S Zhang, FK Soong, L He, L Xie 2021 IEEE Spoken Language Technology Workshop (SLT), 403-409, 2021 | 79 | 2021 |
Word embedding for recurrent neural network based TTS synthesis P Wang, Y Qian, FK Soong, L He, H Zhao 2015 IEEE International Conference on Acoustics, Speech and Signal …, 2015 | 75 | 2015 |
Adaspeech 4: Adaptive text to speech in zero-shot scenarios Y Wu, X Tan, B Li, L He, S Zhao, R Song, T Qin, TY Liu arXiv preprint arXiv:2204.00436, 2022 | 73 | 2022 |
Delightfultts: The microsoft speech synthesis system for blizzard challenge 2021 Y Liu, Z Xu, G Wang, K Chen, B Li, X Tan, J Li, L He, S Zhao arXiv preprint arXiv:2110.12612, 2021 | 66 | 2021 |
Improving prosody with linguistic and bert derived features in multi-speaker based mandarin chinese neural tts Y Xiao, L He, H Ming, FK Soong ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and …, 2020 | 62 | 2020 |
A new GAN-based end-to-end TTS training algorithm H Guo, FK Soong, L He, L Xie arXiv preprint arXiv:1904.04775, 2019 | 61 | 2019 |
Audit: Audio editing by following instructions with latent diffusion models Y Wang, Z Ju, X Tan, L He, Z Wu, J Bian Advances in Neural Information Processing Systems 36, 71340-71357, 2023 | 57 | 2023 |
Binauralgrad: A two-stage conditional diffusion probabilistic model for binaural audio synthesis Y Leng, Z Chen, J Guo, H Liu, J Chen, X Tan, D Mandic, L He, X Li, T Qin, ... Advances in Neural Information Processing Systems 35, 23689-23700, 2022 | 53 | 2022 |
Speaker and language factorization in DNN-based TTS synthesis Y Fan, Y Qian, FK Soong, L He 2016 IEEE International Conference on Acoustics, Speech and Signal …, 2016 | 48 | 2016 |