Speech technology for healthcare: Opportunities, challenges, and state of the art

S Latif, J Qadir, A Qayyum, M Usama… - IEEE Reviews in …, 2020 - ieeexplore.ieee.org
Speech technology is not appropriately explored even though modern advances in speech
technology—especially those driven by deep learning (DL) technology—offer …

A survey on voice assistant security: Attacks and countermeasures

C Yan, X Ji, K Wang, Q Jiang, Z **, W Xu - ACM Computing Surveys, 2022 - dl.acm.org
Voice assistants (VA) have become prevalent on a wide range of personal devices such as
smartphones and smart speakers. As companies build voice assistants with extra …

Streaming end-to-end speech recognition for mobile devices

Y He, TN Sainath, R Prabhavalkar… - ICASSP 2019-2019 …, 2019 - ieeexplore.ieee.org
End-to-end (E2E) models, which directly predict output character sequences given input
speech, are good candidates for on-device speech recognition. E2E models, however …

Natural tts synthesis by conditioning wavenet on mel spectrogram predictions

J Shen, R Pang, RJ Weiss, M Schuster… - … on acoustics, speech …, 2018 - ieeexplore.ieee.org
This paper describes Tacotron 2, a neural network architecture for speech synthesis directly
from text. The system is composed of a recurrent sequence-to-sequence feature prediction …

Tacotron: Towards end-to-end speech synthesis

Y Wang, RJ Skerry-Ryan, D Stanton, Y Wu… - arxiv preprint arxiv …, 2017 - arxiv.org
A text-to-speech synthesis system typically consists of multiple stages, such as a text
analysis frontend, an acoustic model and an audio synthesis module. Building these …

[PDF][PDF] Wavenet: A generative model for raw audio

A Van Den Oord, S Dieleman, H Zen… - arxiv preprint arxiv …, 2016 - academia.edu
This paper introduces WaveNet, a deep neural network for generating raw audio waveforms.
The model is fully probabilistic and autoregressive, with the predictive distribution for each …

[PDF][PDF] Speaker-dependent wavenet vocoder.

A Tamamori, T Hayashi, K Kobayashi, K Takeda… - Interspeech, 2017 - isca-archive.org
In this study, we propose a speaker-dependent WaveNet vocoder, a method of synthesizing
speech waveforms with WaveNet, by utilizing acoustic features from existing vocoder as …

[PDF][PDF] Tacotron: A fully end-to-end text-to-speech synthesis model

Y Wang, RJ Skerry-Ryan… - arxiv preprint …, 2017 - bengio.abracadoudou.com
ABSTRACT A text-to-speech synthesis system typically consists of multiple stages, such as a
text analysis frontend, an acoustic model and an audio synthesis module. Building these …

[PDF][PDF] Shallow-Fusion End-to-End Contextual Biasing.

D Zhao, TN Sainath, D Rybach, P Rondon, D Bhatia… - Interspeech, 2019 - isca-archive.org
Contextual biasing to a specific domain, including a user's song names, app names and
contact names, is an important component of any production-level automatic speech …

Two-pass end-to-end speech recognition

TN Sainath, R Pang, D Rybach, Y He… - arxiv preprint arxiv …, 2019 - arxiv.org
The requirements for many applications of state-of-the-art speech recognition systems
include not only low word error rate (WER) but also low latency. Specifically, for many use …