Givt: Generative infinite-vocabulary transformers

M Tschannen, C Eastwood, F Mentzer - European Conference on …, 2024‏ - Springer
Abstract We introduce Generative Infinite-Vocabulary Transformers (GIVT) which generate
vector sequences with real-valued entries, instead of discrete tokens from a finite …

Recent advances in speech language models: A survey

W Cui, D Yu, X Jiao, Z Meng, G Zhang, Q Wang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Large Language Models (LLMs) have recently garnered significant attention, primarily for
their capabilities in text-based interactions. However, natural human interaction often relies …

Next Token Prediction Towards Multimodal Intelligence: A Comprehensive Survey

L Chen, Z Wang, S Ren, L Li, H Zhao, Y Li… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Building on the foundations of language modeling in natural language processing, Next
Token Prediction (NTP) has evolved into a versatile training objective for machine learning …

Wavchat: A survey of spoken dialogue models

S Ji, Y Chen, M Fang, J Zuo, J Lu, H Wang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Recent advancements in spoken dialogue models, exemplified by systems like GPT-4o,
have captured significant attention in the speech domain. Compared to traditional three-tier …

VALL-T: Decoder-Only Generative Transducer for Robust and Decoding-Controllable Text-to-Speech

C Du, Y Guo, H Wang, Y Yang, Z Niu, S Wang… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Recent TTS models with decoder-only Transformer architecture, such as SPEAR-TTS and
VALL-E, achieve impressive naturalness and demonstrate the ability for zero-shot …

BAT: Learning to Reason about Spatial Sounds with Large Language Models

Z Zheng, P Peng, Z Ma, X Chen, E Choi… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Spatial sound reasoning is a fundamental human skill, enabling us to navigate and interpret
our surroundings based on sound. In this paper we present BAT, which combines the spatial …

Streaming decoder-only automatic speech recognition with discrete speech units: A pilot study

P Chen, S Sun, C Shan, Q Yang, L **e - arxiv preprint arxiv:2406.18862, 2024‏ - arxiv.org
Unified speech-text models like SpeechGPT, VioLA, and AudioPaLM have shown
impressive performance across various speech-related tasks, especially in Automatic …

3D-Speaker-Toolkit: An Open Source Toolkit for Multi-modal Speaker Verification and Diarization

Y Chen, S Zheng, H Wang, L Cheng, T Zhu… - arxiv preprint arxiv …, 2024‏ - arxiv.org
This paper introduces 3D-Speaker-Toolkit, an open source toolkit for multi-modal speaker
verification and diarization. It is designed for the needs of academic researchers and …

Task Arithmetic for Language Expansion in Speech Translation

YF Cheng, H Futami, Y Kashiwagi, E Tsunoo… - arxiv preprint arxiv …, 2024‏ - arxiv.org
Recent advances in large language models (LLMs) have gained interest in speech-text
multimodal foundation models, achieving strong performance on instruction-based speech …

Improving Audio Explanations using Audio Language Models

A Akman, Q Sun, BW Schuller - IEEE Signal Processing Letters, 2025‏ - ieeexplore.ieee.org
Foundation models are widely utilised for their strong representational capabilities, driven by
training on extensive datasets with self-supervised learning. The increasing complexity of …