Making flow-matching-based zero-shot text-to-speech laugh as you like

N Kanda, X Wang, SE Eskimez, M Thakker… - arxiv preprint arxiv …, 2024 - arxiv.org
Laughter is one of the most expressive and natural aspects of human speech, conveying
emotions, social cues, and humor. However, most text-to-speech (TTS) systems lack the …

Laugh Now Cry Later: Controlling Time-Varying Emotional States of Flow-Matching-Based Zero-Shot Text-To-Speech

H Wu, X Wang, SE Eskimez, M Thakker… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
People change their tones of voice, often accompanied by nonverbal vocalizations (NVs)
such as laughter and cries, to convey rich emotions. However, most text-to-speech (TTS) …

How" Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?

S Papi, P Polak, O Bojar, D Macháček - arxiv preprint arxiv:2412.18495, 2024 - arxiv.org
Simultaneous speech-to-text translation (SimulST) translates source-language speech into
target-language text concurrently with the speaker's speech, ensuring low latency for better …

Streaming Speaker Change Detection and Gender Classification for Transducer-Based Multi-Talker Speech Translation

P Wang, N Kanda, J Xue, J Li, X Wang… - arxiv preprint arxiv …, 2025 - arxiv.org
Streaming multi-talker speech translation is a task that involves not only generating accurate
and fluent translations with low latency but also recognizing when a speaker change occurs …

Soft Language Identification for Language-Agnostic Many-to-One End-to-End Speech Translation

P Wang, J Xue, J Li, J Chen… - arxiv preprint arxiv …, 2024 - arxiv.org
Language-agnostic many-to-one end-to-end speech translation models can convert audio
signals from different source languages into text in a target language. These models do not …

LS-EEND: Long-Form Streaming End-to-End Neural Diarization with Online Attractor Extraction

D Liang, X Li - arxiv preprint arxiv:2410.06670, 2024 - arxiv.org
This work proposes a frame-wise online/streaming end-to-end neural diarization (EEND)
method, which detects speaker activities in a frame-in-frame-out fashion. The proposed …