Direct speech-to-speech translation with a sequence-to-sequence model

Y Jia, RJ Weiss, F Biadsy, W Macherey… - arxiv preprint arxiv …, 2019 - arxiv.org
We present an attention-based sequence-to-sequence neural network which can directly
translate speech from one language into speech in another language, without relying on an …

A generative model for raw audio using transformer architectures

P Verma, C Chafe - … Conference on Digital Audio Effects (DAFx …, 2021 - ieeexplore.ieee.org
This paper proposes a novel way of doing audio synthesis at the waveform level using
Transformer architectures. We propose a deep neural network for generating waveforms …

Tibetan–Chinese speech-to-speech translation based on discrete units

Z Gong, X Xu, Y Zhao - Scientific Reports, 2025 - nature.com
Speech-to-speech translation (S2ST) has evolved from cascade systems which integrate
Automatic Speech Recognition (ASR), Machine Translation (MT), and Text-to-Speech (TTS) …

Neural Architectures Learning Fourier Transforms, Signal Processing and Much More....

P Verma - arxiv preprint arxiv:2308.10388, 2023 - arxiv.org
This report will explore and answer fundamental questions about taking Fourier Transforms
and tying it with recent advances in AI and neural architecture. One interpretation of the …

Kazakh-Uzbek Speech Cascade Machine Translation on Complete Set of Endings

T Balabekova, B Kairatuly, U Tukeyev - International Conference on …, 2023 - Springer
Studies of speech-to-speech machine translation for Turkic languages are practically absent
due to the difficulties of creating parallel speech corpora for training neural models …

Multi-Task Self-Supervised Learning Based Tibetan-Chinese Speech-to-Speech Translation

R Liu, Y Zhao, X Xu - 2023 International Conference on Asian …, 2023 - ieeexplore.ieee.org
Speech-to-speech translation tasks are commonly tackled by using a three-level cascade
system which comprises of speech recognition, machine translation, and speech synthesis …

Learning to model aspects of hearing perception using neural loss functions

P Verma, J Berger - arxiv preprint arxiv:1912.05683, 2019 - arxiv.org
We present a framework to model the perceived quality of audio signals by combining
convolutional architectures, with ideas from classical signal processing, and describe an …