Making flow-matching-based zero-shot text-to-speech laugh as you like
Laughter is one of the most expressive and natural aspects of human speech, conveying
emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the …
emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the …
Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-To-Speech
People change their tones of voice, often accompanied by nonverbal vocalizations (NVs)
such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) …
such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) …
How" Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?
Simultaneous speech-to-text translation (SimulST) translates source-language speech into
target-language text concurrently with the speaker's speech, ensuring low latency for better …
target-language text concurrently with the speaker's speech, ensuring low latency for better …
Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation
Streaming multi-talker speech translation is a task that involves not only generating accurate
and fluent translations with low latency but also recognizing when a speaker change occurs …
and fluent translations with low latency but also recognizing when a speaker change occurs …
Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation
Language-agnostic many-to-one end-to-end speech translation models can convert audio
signals from different source languages into text in a target language. These models do not …
signals from different source languages into text in a target language. These models do not …
LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction
D Liang, X Li - arxiv preprint arxiv:2410.06670, 2024 - arxiv.org
This work proposes a frame-wise online/streaming end-to-end neural diarization (EEND)
method, which detects speaker activities in a frame-in-frame-out fashion. The proposed …
method, which detects speaker activities in a frame-in-frame-out fashion. The proposed …