Automated audio captioning: An overview of recent progress and new challenges

X Mei, X Liu, MD Plumbley, W Wang - … journal on audio, speech, and music …, 2022 - Springer
Automated audio captioning is a cross-modal translation task that aims to generate natural
language descriptions for given audio clips. This task has received increasing attention with …

Masked spectrogram prediction for self-supervised audio pre-training

D Chong, H Wang, P Zhou… - ICASSP 2023-2023 IEEE …, 2023 - ieeexplore.ieee.org
Transformer-based models attain excellent results and generalize well when trained on
sufficient amounts of data. However, constrained by the limited data available in the audio …

Specaugment++: A hidden space data augmentation method for acoustic scene classification

H Wang, Y Zou, W Wang - arxiv preprint arxiv:2103.16858, 2021 - arxiv.org
In this paper, we present SpecAugment++, a novel data augmentation method for deep
neural networks based acoustic scene classification (ASC). Different from other popular data …

Improving the performance of automated audio captioning via integrating the acoustic and semantic information

Z Ye, H Wang, D Yang, Y Zou - arxiv preprint arxiv:2110.06100, 2021 - arxiv.org
Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic
signal processing and natural language processing to generate human-readable sentences …

Attentional graph convolutional network for structure-aware audiovisual scene classification

L Zhou, Y Zhou, X Qi, J Hu, TL Lam… - IEEE Transactions on …, 2023 - ieeexplore.ieee.org
Audiovisual scene understanding is a challenging problem due to the unstructured spatial–
temporal relations that exist in the audio signals and spatial layouts of different objects in the …

A multimodal aggregation network with serial self-attention mechanism for micro-video multi-label classification

W Lu, J Lin, P **g, Y Su - IEEE Signal Processing Letters, 2023 - ieeexplore.ieee.org
Currently, micro-videos have attracted increasing attention due to their unique properties
and great commercial value. Considering that micro-videos naturally incorporate multimodal …

Self-supervised graphs for audio representation learning with limited labeled data

A Shirian, K Somandepalli… - IEEE Journal of Selected …, 2022 - ieeexplore.ieee.org
Large-scale databases with high-quality manual labels are scarce in audio domain. We thus
explore a self-supervised graph approach to learning audio representations from highly …

Knowledge-integrated Multi-modal Movie Turning Point Identification

D Wang, R Xu, L Cheng, Z Wang - ACM Transactions on Multimedia …, 2024 - dl.acm.org
The rapid development of artificial intelligence provides rich technologies and tools for the
automated understanding of literary works. As a comprehensive carrier of storylines, movies …

Sound event detection in traffic scenes based on graph convolutional network to obtain multi-modal information

Y Jiang, D Guo, L Wang, H Zhang, H Dong… - Complex & Intelligent …, 2024 - Springer
Sound event detection involves identifying sound categories in audio and determining when
they start and end. However, in real-life situations, sound events are usually not isolated …

Graph Node Embeddings for ontology-aware Sound Event Classification: an evaluation study

C Aironi, S Cornell, E Principi… - 2022 30th European …, 2022 - ieeexplore.ieee.org
Multi-label Sound Event Classification (SEC) is a challenging task which requires to handle
multiple co-occurring sound event classes. Recent works proposed an ontology-aware …