MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis

W Guan, Y Li, T Li, H Huang, F Wang, J Lin… - Proceedings of the …, 2024 - ojs.aaai.org
The style transfer task in Text-to-Speech (TTS) refers to the process of transferring style
information into text content to generate corresponding speech with a specific style …

Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation

K Zhang, Z Hua, Y Zhang, Y Guo… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
AI-synthesized speech, also known as deepfake speech, has recently raised significant
concerns due to the rapid advancement of speech synthesis and speech conversion …

FT-GAN: Fine-Grained Tune Modeling for Chinese Opera Synthesis

M Zheng, P Bai, X Shi, X Zhou, Y Yan - Proceedings of the AAAI …, 2024 - ojs.aaai.org
Although singing voice synthesis (SVS) has made significant progress recently, with its
unique styles and various genres, Chinese opera synthesis requires greater attention but is …

An end-to-end approach for chord-conditioned song generation

S Gao, S Lei, F Zhuo, H Liu, F Liu, B Tang… - arxiv preprint arxiv …, 2024 - arxiv.org
The Song Generation task aims to synthesize music composed of vocals and
accompaniment from given lyrics. While the existing method, Jukebox, has explored this …

Hybrid Learning Module-Based Transformer for Multitrack Music Generation With Music Theory

Y Tie, X Guo, D Zhang, J Tie, L Qi… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
In recent years, multitrack music generation has garnered significant attention in both
academic and industrial spheres for its versatile utilization of various instruments in …

[PDF][PDF] Challenge of Singing Voice Synthesis Using Only Text-To-Speech Corpus With FIRNet Source-Filter Neural Vocoder

T Okamoto, Y Ohtani, S Shimizu, T Toda… - Proc. Interspeech …, 2024 - isca-archive.org
Singing voice synthesis (SVS) corpora are more costly to collect than TTS corpora. SVS
using only a TTS corpus is challenging because the ranges of fundamental frequency (fo) …

LNACont: Language-Normalized Affine Coupling Layer with Contrastive Learning for Cross-Lingual Multi-Speaker Text-to-Speech

S Hwang, C Kim - 2024 32nd European Signal Processing …, 2024 - ieeexplore.ieee.org
The current advancement in text-to-speech (TTS) has achieved a commendable level of
reproducing human-like voices, including diverse speaking style such as multiple speaker …

LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modeling

Y Huang, X Lai, M Ye, A Zhu, Z Wang, J Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
Singing Voice Conversion (SVC) has emerged as a significant subfield of Voice Conversion
(VC), enabling the transformation of one singer's voice into another while preserving musical …

A Dual-branch Multi-Band Neural Vocoder with Harmonic Discriminator for High-Fidelity Speech Synthesis

N Xu, H Liu - openreview.net
Recent developments in vocoders are primarily dominated by GAN-based networks
targeting to high-quality waveform generation from mel-spectrogram representations …