MM-TTS: Multi-Modal Prompt Based Style Transfer for Expressive Text-to-Speech Synthesis
The style transfer task in Text-to-Speech (TTS) refers to the process of transferring style
information into text content to generate corresponding speech with a specific style …
information into text content to generate corresponding speech with a specific style …
Robust AI-Synthesized Speech Detection Using Feature Decomposition Learning and Synthesizer Feature Augmentation
AI-synthesized speech, also known as deepfake speech, has recently raised significant
concerns due to the rapid advancement of speech synthesis and speech conversion …
concerns due to the rapid advancement of speech synthesis and speech conversion …
FT-GAN: Fine-Grained Tune Modeling for Chinese Opera Synthesis
Although singing voice synthesis (SVS) has made significant progress recently, with its
unique styles and various genres, Chinese opera synthesis requires greater attention but is …
unique styles and various genres, Chinese opera synthesis requires greater attention but is …
An end-to-end approach for chord-conditioned song generation
S Gao, S Lei, F Zhuo, H Liu, F Liu, B Tang… - arxiv preprint arxiv …, 2024 - arxiv.org
The Song Generation task aims to synthesize music composed of vocals and
accompaniment from given lyrics. While the existing method, Jukebox, has explored this …
accompaniment from given lyrics. While the existing method, Jukebox, has explored this …
Hybrid Learning Module-Based Transformer for Multitrack Music Generation With Music Theory
Y Tie, X Guo, D Zhang, J Tie, L Qi… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
In recent years, multitrack music generation has garnered significant attention in both
academic and industrial spheres for its versatile utilization of various instruments in …
academic and industrial spheres for its versatile utilization of various instruments in …
[PDF][PDF] Challenge of Singing Voice Synthesis Using Only Text-To-Speech Corpus With FIRNet Source-Filter Neural Vocoder
Singing voice synthesis (SVS) corpora are more costly to collect than TTS corpora. SVS
using only a TTS corpus is challenging because the ranges of fundamental frequency (fo) …
using only a TTS corpus is challenging because the ranges of fundamental frequency (fo) …
LNACont: Language-Normalized Affine Coupling Layer with Contrastive Learning for Cross-Lingual Multi-Speaker Text-to-Speech
The current advancement in text-to-speech (TTS) has achieved a commendable level of
reproducing human-like voices, including diverse speaking style such as multiple speaker …
reproducing human-like voices, including diverse speaking style such as multiple speaker …
LHQ-SVC: Lightweight and High Quality Singing Voice Conversion Modeling
Y Huang, X Lai, M Ye, A Zhu, Z Wang, J Xu… - arxiv preprint arxiv …, 2024 - arxiv.org
Singing Voice Conversion (SVC) has emerged as a significant subfield of Voice Conversion
(VC), enabling the transformation of one singer's voice into another while preserving musical …
(VC), enabling the transformation of one singer's voice into another while preserving musical …
A Dual-branch Multi-Band Neural Vocoder with Harmonic Discriminator for High-Fidelity Speech Synthesis
N Xu, H Liu - openreview.net
Recent developments in vocoders are primarily dominated by GAN-based networks
targeting to high-quality waveform generation from mel-spectrogram representations …
targeting to high-quality waveform generation from mel-spectrogram representations …