A multimodal approach to device-directed speech detection with large language models

D Wagner, A Churchill, S Sigtia… - ICASSP 2024-2024 …, 2024‏ - ieeexplore.ieee.org
Interactions with virtual assistants typically start with a predefined trigger phrase followed by
the user command. To make interactions with the assistant more intuitive, we explore …

Leveraging Contrastive Language–Image Pre-Training and Bidirectional Cross-attention for Multimodal Keyword Spotting

D Liu, Q Mao, L Gao, G Wang - Engineering Applications of Artificial …, 2024‏ - Elsevier
In resource-limited keyword spotting scenarios, the scarcity of annotated corpora hinders
deep learning's ability to develop robust models for representing acoustic features. Recent …

[PDF][PDF] Small footprint multi-channel network for keyword spotting with centroid based awareness

D Ng, Y **ao, JQ Yip, Z Yang, B Tian, Q Fu… - Proc …, 2023‏ - isca-archive.org
Abstract Spoken Keyword Spotting (KWS) in noisy far-field environments is challenging for
small-footprint models, given the restrictions on computational resources (eg, model size …

Self-supervised learning-for underwater acoustic signal classification with mixup

Q Xu, J Jiang, K Xu, Y Dou, C Gao… - IEEE Journal of …, 2023‏ - ieeexplore.ieee.org
Underwater acoustic signal classification is a critical task that involves identifying different
types of signals in a complex and dynamic underwater environment, which is often …

Multimodal data and resource efficient device-directed speech detection with large foundation models

D Wagner, A Churchill, S Sigtia, P Georgiou… - arxiv preprint arxiv …, 2023‏ - arxiv.org
Interactions with virtual assistants typically start with a trigger phrase followed by a
command. In this work, we explore the possibility of making these interactions more natural …

Aca-net: Towards lightweight speaker verification using asymmetric cross attention

JQ Yip, T Truong, D Ng, C Zhang, Y Ma… - arxiv preprint arxiv …, 2023‏ - arxiv.org
In this paper, we propose ACA-Net, a lightweight, global context-aware speaker embedding
extractor for Speaker Verification (SV) that improves upon existing work by using Asymmetric …

[PDF][PDF] Dual-memory multimodal learning for continual spoken keyword spotting with confidence selection and diversity enhancement

Z Yang, D Ng, X Li, C Zhang, R Jiang, W **, Y Ma… - Proc …, 2023‏ - isca-archive.org
Enabling continual learning (CL) from an ever-changing environment is highly valuable, but
it poses significant challenges for spoken keyword spotting (KWS), which simultaneously …

Efficient time and energy optimization in NOMA-enabled mobile edge computing through partial offloading

D Liu, Y Liu, L Khoukhi, A Hafid… - Tsinghua Science …, 2024‏ - ieeexplore.ieee.org
Customized keyword spotting needs to adapt quickly to small user samples. Current
methods primarily solve the problem under moderate noise conditions. Recent work …

Machine Learning Analysis of Radio Data to Uncover Community Perceptions on the Ebola Outbreak in Uganda

J Nakatumba-Nabende, J Mukiibi, TS Bateesa… - ACM Journal on …, 2024‏ - dl.acm.org
Radio is vital for people, especially in rural areas, to share their concerns through interactive
talk shows. Understanding public perceptions of pandemics is crucial because they …

SELMA: A Speech-Enabled Language Model for Virtual Assistant Interactions

D Wagner, A Churchill, S Sigtia, E Marchi - arxiv preprint arxiv …, 2025‏ - arxiv.org
In this work, we present and evaluate SELMA, a Speech-Enabled Language Model for
virtual Assistant interactions that integrates audio and text as inputs to a Large Language …