Active Speaker Detection using Audio, Visual and Depth Modalities: A Survey

SNAM Robi, MAZM Ariffin, MAM Izhar, N Ahmad… - IEEE …, 2024 - ieeexplore.ieee.org
The rapid progress of multimodal signal processing in recent years has cleared the way for
novel applications in human-computer interaction, surveillance, and telecommunication …

The partialspoof database and countermeasures for the detection of short fake speech segments embedded in an utterance

L Zhang, X Wang, E Cooper, N Evans… - … /ACM Transactions on …, 2022 - ieeexplore.ieee.org
Automatic speaker verification is susceptible to various manipulations and spoofing, such as
text-to-speech synthesis, voice conversion, replay, tampering, adversarial attacks, and so …

Voice activity detection in the wild: A data-driven approach using teacher-student training

H Dinkel, S Wang, X Xu, M Wu… - IEEE/ACM Transactions …, 2021 - ieeexplore.ieee.org
Voice activity detection is an essential pre-processing component for speech-related tasks
such as automatic speech recognition (ASR). Traditional supervised VAD systems obtain …

Marblenet: Deep 1d time-channel separable convolutional neural network for voice activity detection

F Jia, S Majumdar, B Ginsburg - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org
We present MarbleNet, an end-to-end neural network for Voice Activity Detection (VAD).
MarbleNet is a deep residual network composed from blocks of 1D time-channel separable …

A theory-driven deep learning method for voice chat–based customer response prediction

G Chen, S **ao, C Zhang… - Information Systems …, 2023 - pubsonline.informs.org
As artificial intelligence and digitalization technologies are flourishing real-time, online
interaction–based commercial modes, exploiting customers' purchase intention implied in …

The Hitachi-JHU DIHARD III system: Competitive end-to-end neural diarization and x-vector clustering systems combined by DOVER-Lap

S Horiguchi, N Yalta, P Garcia, Y Takashima… - arxiv preprint arxiv …, 2021 - arxiv.org
This paper provides a detailed description of the Hitachi-JHU system that was submitted to
the Third DIHARD Speech Diarization Challenge. The system outputs the ensemble results …

Nanowatt acoustic inference sensing exploiting nonlinear analog feature extraction

M Yang, H Liu, W Shan, J Zhang… - IEEE Journal of Solid …, 2021 - ieeexplore.ieee.org
Ultralow-power sensing with inference functionality embedded in sensor nodes is essential
for enabling the emerging pervasive intelligence. For acoustic inference sensing, the feature …

An initial investigation for detecting partially spoofed audio

L Zhang, X Wang, E Cooper, J Yamagishi… - arxiv preprint arxiv …, 2021 - arxiv.org
All existing databases of spoofed speech contain attack data that is spoofed in its entirety. In
practice, it is entirely plausible that successful attacks can be mounted with utterances that …

CNN-based speech segments endpoints detection framework using short-time signal energy features

G Ahmed, AA Lawaye - International Journal of Information Technology, 2023 - Springer
Abstract The quality of Speech Recognition systems has improved, with a shift focus from
short utterance scenarios like Voice Assistants and Voice Search to extended utterance …

Speaker detection in the wild: Lessons learned from JSALT 2019

P García, J Villalba, H Bredin, J Du, D Castan… - arxiv preprint arxiv …, 2019 - arxiv.org
This paper presents the problems and solutions addressed at the JSALT workshop when
using a single microphone for speaker detection in adverse scenarios. The main focus was …