Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models

Y Chu, J Xu, X Zhou, Q Yang, S Zhang, Z Yan… - arxiv preprint arxiv …, 2023 - arxiv.org
Recently, instruction-following audio-language models have received broad attention for
audio interaction with humans. However, the absence of pre-trained audio models capable …

Air-bench: Benchmarking large audio-language models via generative comprehension

Q Yang, J Xu, W Liu, Y Chu, Z Jiang, X Zhou… - arxiv preprint arxiv …, 2024 - arxiv.org
Recently, instruction-following audio-language models have received broad attention for
human-audio interaction. However, the absence of benchmarks capable of evaluating audio …

Owsm v3. 1: Better and faster open whisper-style speech models based on e-branchformer

Y Peng, J Tian, W Chen, S Arora, B Yan, Y Sudo… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent studies have highlighted the importance of fully open foundation models. The Open
Whisper-style Speech Model (OWSM) is an initial step towards reproducing OpenAI Whisper …

Speechverse: A large-scale generalizable audio language model

N Das, S Dingliwal, S Ronanki, R Paturi… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language models (LLMs) have shown incredible proficiency in performing tasks that
require semantic understanding of natural language instructions. Recently, many works …

Viola: Conditional language models for speech recognition, synthesis, and translation

T Wang, L Zhou, Z Zhang, Y Wu, S Liu… - … on Audio, Speech …, 2024 - ieeexplore.ieee.org
Recent research shows a big convergence in model architecture, training objectives, and
inference methods across various tasks for different modalities. In this paper, we propose …

Cosmic: Data efficient instruction-tuning for speech in-context learning

J Pan, J Wu, Y Gaur, S Sivasankaran, Z Chen… - arxiv preprint arxiv …, 2023 - arxiv.org
We present a cost-effective method to integrate speech into a large language model (LLM),
resulting in a Contextual Speech Model with Instruction-following/in-context-learning …

Ssdm: Scalable speech dysfluency modeling

J Lian, X Zhou, Z Ezzes, J Vonk… - Advances in neural …, 2025 - proceedings.neurips.cc
Speech dysfluency modeling is the core module for spoken language learning, and speech
therapy. However, there are three challenges. First, current state-of-the-art solutions~~\cite …

Bestow: Efficient and streamable speech language model with the best of two worlds in gpt and t5

Z Chen, H Huang, O Hrinchuk… - 2024 IEEE Spoken …, 2024 - ieeexplore.ieee.org
Incorporating speech understanding capabilities into pretrained large-language models has
become a vital research direction (SpeechLLM). The previous architectures can be …

Retrieval augmented end-to-end spoken dialog models

M Wang, I Shafran, H Soltau, W Han… - ICASSP 2024-2024 …, 2024 - ieeexplore.ieee.org
We recently developed a joint speech and language model (SLM [1]) which fuses a
pretrained foundational speech model and a large language model (LLM), while preserving …

Desta: Enhancing speech language models through descriptive speech-text alignment

KH Lu, Z Chen, SW Fu, H Huang, B Ginsburg… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent speech language models (SLMs) typically incorporate pre-trained speech models to
extend the capabilities from large language models (LLMs). In this paper, we propose a …