Automated audio captioning: An overview of recent progress and new challenges

X Mei, X Liu, MD Plumbley, W Wang - … journal on audio, speech, and music …, 2022 - Springer
Automated audio captioning is a cross-modal translation task that aims to generate natural
language descriptions for given audio clips. This task has received increasing attention with …

Listen, think, and understand

Y Gong, H Luo, AH Liu, L Karlinsky, J Glass - arxiv preprint arxiv …, 2023 - arxiv.org
The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is
crucial for many applications. Although significant progress has been made in this area …

Valor: Vision-audio-language omni-perception pretraining model and dataset

S Chen, X He, L Guo, X Zhu, W Wang, J Tang… - arxiv preprint arxiv …, 2023 - arxiv.org
In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …

Beyond the status quo: A contemporary survey of advances and challenges in audio captioning

X Xu, Z **e, M Wu, K Yu - IEEE/ACM Transactions on Audio …, 2023 - ieeexplore.ieee.org
Automated audio captioning (AAC), a task that mimics human perception as well as
innovatively links audio processing and natural language processing, has overseen much …

Audio captioning transformer

X Mei, X Liu, Q Huang, MD Plumbley… - arxiv preprint arxiv …, 2021 - arxiv.org
Audio captioning aims to automatically generate a natural language description of an audio
clip. Most captioning models follow an encoder-decoder architecture, where the decoder …

Prefix tuning for automated audio captioning

M Kim, K Sung-Bin, TH Oh - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
Audio captioning aims to generate text descriptions from environmental sounds. One
challenge of audio captioning is the difficulty of the generalization due to the lack of audio …

Valor: Vision-audio-language omni-perception pretraining model and dataset

J Liu, S Chen, X He, L Guo, X Zhu… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
In this paper, we propose the Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multimodal understanding and generation. Unlike widely-studied vision …

Investigating local and global information for automated audio captioning with transfer learning

X Xu, H Dinkel, M Wu, Z **e, K Yu - ICASSP 2021-2021 IEEE …, 2021 - ieeexplore.ieee.org
Automated audio captioning (AAC) aims at generating summarizing descriptions for audio
clips. Multitudinous concepts are described in an audio caption, ranging from local …

An encoder-decoder based audio captioning system with transfer and reinforcement learning

X Mei, Q Huang, X Liu, G Chen, J Wu, Y Wu… - arxiv preprint arxiv …, 2021 - arxiv.org
Automated audio captioning aims to use natural language to describe the content of audio
data. This paper presents an audio captioning system with an encoder-decoder architecture …

Unified model for image, video, audio and language tasks

M Shukor, C Dancette, A Rame, M Cord - arxiv preprint arxiv:2307.16184, 2023 - arxiv.org
Large Language Models (LLMs) have made the ambitious quest for generalist agents
significantly far from being a fantasy. A key hurdle for building such general models is the …