Funaudiollm: Voice understanding and generation foundation models for natural interaction between humans and llms

K An, Q Chen, C Deng, Z Du, C Gao, Z Gao… - arxiv preprint arxiv …, 2024 - arxiv.org
This report introduces FunAudioLLM, a model family designed to enhance natural voice
interactions between humans and large language models (LLMs). At its core are two …

Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot

A Zeng, Z Du, M Liu, K Wang, S Jiang, L Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
We introduce GLM-4-Voice, an intelligent and human-like end-to-end spoken chatbot. It
supports both Chinese and English, engages in real-time voice conversations, and varies …

Minmo: A multimodal large language model for seamless voice interaction

Q Chen, Y Chen, Y Chen, M Chen, Y Chen… - arxiv preprint arxiv …, 2025 - arxiv.org
Recent advancements in large language models (LLMs) and multimodal speech-text
models have laid the groundwork for seamless voice interactions, enabling real-time …

[PDF][PDF] A multitask training approach to enhance whisper with open-vocabulary keyword spotting

Y Li, M Zhang, C Su, Y Li, X Qiao, M Ren, M Ma… - Interspeech, 2024 - isca-archive.org
The recognition of rare named entities, such as personal names and terminologies, is
challenging for automatic speech recognition (ASR) systems, especially when they are not …

CTC-Assisted LLM-Based Contextual ASR

G Yang, Z Ma, Z Gao, S Zhang… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Contextual ASR or hotword customization holds substantial practical value. Despite the
impressive performance of current end-to-end (E2E) automatic speech recognition (ASR) …

A multitask training approach to enhance whisper with contextual biasing and open-vocabulary keyword spotting

Y Li, M Zhang, C Su, Y Li, X Qiao, M Ren, M Ma… - arxiv preprint arxiv …, 2023 - arxiv.org
The recognition of rare named entities, such as personal names and terminologies, is
challenging for automatic speech recognition (ASR) systems, especially when they are not …

VHASR: A Multimodal Speech Recognition System With Vision Hotwords

J Hu, Z Li, P Wang, H Ai, L Zhang, H Zhao - arxiv preprint arxiv …, 2024 - arxiv.org
The image-based multimodal automatic speech recognition (ASR) model enhances speech
recognition performance by incorporating audio-related image. However, some works …

An efficient text augmentation approach for contextualized Mandarin speech recognition

N Zheng, X Wan, K Liu, Z Du, Z Huan - arxiv preprint arxiv:2406.09950, 2024 - arxiv.org
Although contextualized automatic speech recognition (ASR) systems are commonly used to
improve the recognition of uncommon words, their effectiveness is hindered by the inherent …

CB-whisper: Contextual biasing whisper using open-vocabulary keyword-spotting

Y Li, Y Li, M Zhang, C Su, J Yu, M Piao… - Proceedings of the …, 2024 - aclanthology.org
End-to-end automatic speech recognition (ASR) systems often struggle to recognize rare
name entities, such as personal names, organizations and terminologies that are not …

[PDF][PDF] Contextual Biasing with Confidence-based Homophone Detector for Mandarin End-to-End Speech Recognition

C Yang, L Zheng, S Tian, G Cheng, S **ao… - Proc. Interspeech …, 2024 - isca-archive.org
Deep biasing methods and shallow fusion methods have been demonstrated to improve the
performance of end-to-end ASR effectively. However, accurate recognition often becomes …