CTC alignments improve autoregressive translation

B Yan, S Dalmia, Y Higuchi, G Neubig, F Metze… - arxiv preprint arxiv …, 2022 - arxiv.org
Connectionist Temporal Classification (CTC) is a widely used approach for automatic
speech recognition (ASR) that performs conditionally independent monotonic alignment …

Deep speech synthesis from MRI-based articulatory representations

P Wu, T Li, Y Lu, Y Zhang, J Lian, AW Black… - arxiv preprint arxiv …, 2023 - arxiv.org
In this paper, we study articulatory synthesis, a speech synthesis method using human vocal
tract information that offers a way to develop efficient, generalizable and interpretable …

Recent advances in end-to-end simultaneous speech translation

X Liu, G Hu, Y Du, E He, YF Luo, C Xu, T **ao… - arxiv preprint arxiv …, 2024 - arxiv.org
Simultaneous speech translation (SimulST) is a demanding task that involves generating
translations in real-time while continuously processing speech input. This paper offers a …

ESPnet-ST-v2: Multipurpose spoken language translation toolkit

B Yan, J Shi, Y Tang, H Inaguma, Y Peng… - arxiv preprint arxiv …, 2023 - arxiv.org
ESPnet-ST-v2 is a revamp of the open-source ESPnet-ST toolkit necessitated by the
broadening interests of the spoken language translation community. ESPnet-ST-v2 supports …

Bass: Block-wise adaptation for speech summarization

R Sharma, K Zheng, S Arora, S Watanabe… - arxiv preprint arxiv …, 2023 - arxiv.org
End-to-end speech summarization has been shown to improve performance over cascade
baselines. However, such models are difficult to train on very large inputs (dozens of …

[HTML][HTML] Decoupled structure for improved adaptability of end-to-end models

K Deng, PC Woodland - Speech Communication, 2024 - Elsevier
Although end-to-end (E2E) trainable automatic speech recognition (ASR) has shown great
success by jointly learning acoustic and linguistic information, it still suffers from the effect of …

How" Real" is Your Real-Time Simultaneous Speech-to-Text Translation System?

S Papi, P Polak, O Bojar, D Macháček - arxiv preprint arxiv:2412.18495, 2024 - arxiv.org
Simultaneous speech-to-text translation (SimulST) translates source-language speech into
target-language text concurrently with the speaker's speech, ensuring low latency for better …

Long-form end-to-end speech translation via latent alignment segmentation

P Polák, O Bojar - 2024 IEEE Spoken Language Technology …, 2024 - ieeexplore.ieee.org
Contemporary datasets provide an oracle segmentation into sentences based on human-
annotated transcripts and translations. However, the segmentation into sentences is not …

End-to-end single-channel speaker-turn aware conversational speech translation

J Zuluaga-Gomez, Z Huang, X Niu, R Paturi… - arxiv preprint arxiv …, 2023 - arxiv.org
Conventional speech-to-text translation (ST) systems are trained on single-speaker
utterances, and they may not generalize to real-life scenarios where the audio contains …