Automated audio captioning: An overview of recent progress and new challenges
Automated audio captioning is a cross-modal translation task that aims to generate natural
language descriptions for given audio clips. This task has received increasing attention with …
language descriptions for given audio clips. This task has received increasing attention with …
Listen, think, and understand
The ability of artificial intelligence (AI) systems to perceive and comprehend audio signals is
crucial for many applications. Although significant progress has been made in this area …
crucial for many applications. Although significant progress has been made in this area …
Valor: Vision-audio-language omni-perception pretraining model and dataset
In this paper, we propose a Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …
(VALOR) for multi-modal understanding and generation. Different from widely-studied vision …
Beyond the status quo: A contemporary survey of advances and challenges in audio captioning
Automated audio captioning (AAC), a task that mimics human perception as well as
innovatively links audio processing and natural language processing, has overseen much …
innovatively links audio processing and natural language processing, has overseen much …
Audio captioning transformer
Audio captioning aims to automatically generate a natural language description of an audio
clip. Most captioning models follow an encoder-decoder architecture, where the decoder …
clip. Most captioning models follow an encoder-decoder architecture, where the decoder …
Prefix tuning for automated audio captioning
Audio captioning aims to generate text descriptions from environmental sounds. One
challenge of audio captioning is the difficulty of the generalization due to the lack of audio …
challenge of audio captioning is the difficulty of the generalization due to the lack of audio …
Valor: Vision-audio-language omni-perception pretraining model and dataset
In this paper, we propose the Vision-Audio-Language Omni-peRception pretraining model
(VALOR) for multimodal understanding and generation. Unlike widely-studied vision …
(VALOR) for multimodal understanding and generation. Unlike widely-studied vision …
Investigating local and global information for automated audio captioning with transfer learning
Automated audio captioning (AAC) aims at generating summarizing descriptions for audio
clips. Multitudinous concepts are described in an audio caption, ranging from local …
clips. Multitudinous concepts are described in an audio caption, ranging from local …
An encoder-decoder based audio captioning system with transfer and reinforcement learning
Automated audio captioning aims to use natural language to describe the content of audio
data. This paper presents an audio captioning system with an encoder-decoder architecture …
data. This paper presents an audio captioning system with an encoder-decoder architecture …
Unified model for image, video, audio and language tasks
Large Language Models (LLMs) have made the ambitious quest for generalist agents
significantly far from being a fantasy. A key hurdle for building such general models is the …
significantly far from being a fantasy. A key hurdle for building such general models is the …