Automated audio captioning: An overview of recent progress and new challenges
Automated audio captioning is a cross-modal translation task that aims to generate natural
language descriptions for given audio clips. This task has received increasing attention with …
language descriptions for given audio clips. This task has received increasing attention with …
Masked spectrogram prediction for self-supervised audio pre-training
Transformer-based models attain excellent results and generalize well when trained on
sufficient amounts of data. However, constrained by the limited data available in the audio …
sufficient amounts of data. However, constrained by the limited data available in the audio …
Specaugment++: A hidden space data augmentation method for acoustic scene classification
In this paper, we present SpecAugment++, a novel data augmentation method for deep
neural networks based acoustic scene classification (ASC). Different from other popular data …
neural networks based acoustic scene classification (ASC). Different from other popular data …
Improving the performance of automated audio captioning via integrating the acoustic and semantic information
Automated audio captioning (AAC) has developed rapidly in recent years, involving acoustic
signal processing and natural language processing to generate human-readable sentences …
signal processing and natural language processing to generate human-readable sentences …
Attentional graph convolutional network for structure-aware audiovisual scene classification
Audiovisual scene understanding is a challenging problem due to the unstructured spatial–
temporal relations that exist in the audio signals and spatial layouts of different objects in the …
temporal relations that exist in the audio signals and spatial layouts of different objects in the …
A multimodal aggregation network with serial self-attention mechanism for micro-video multi-label classification
W Lu, J Lin, P **g, Y Su - IEEE Signal Processing Letters, 2023 - ieeexplore.ieee.org
Currently, micro-videos have attracted increasing attention due to their unique properties
and great commercial value. Considering that micro-videos naturally incorporate multimodal …
and great commercial value. Considering that micro-videos naturally incorporate multimodal …
Self-supervised graphs for audio representation learning with limited labeled data
Large-scale databases with high-quality manual labels are scarce in audio domain. We thus
explore a self-supervised graph approach to learning audio representations from highly …
explore a self-supervised graph approach to learning audio representations from highly …
Knowledge-integrated Multi-modal Movie Turning Point Identification
D Wang, R Xu, L Cheng, Z Wang - ACM Transactions on Multimedia …, 2024 - dl.acm.org
The rapid development of artificial intelligence provides rich technologies and tools for the
automated understanding of literary works. As a comprehensive carrier of storylines, movies …
automated understanding of literary works. As a comprehensive carrier of storylines, movies …
Sound event detection in traffic scenes based on graph convolutional network to obtain multi-modal information
Y Jiang, D Guo, L Wang, H Zhang, H Dong… - Complex & Intelligent …, 2024 - Springer
Sound event detection involves identifying sound categories in audio and determining when
they start and end. However, in real-life situations, sound events are usually not isolated …
they start and end. However, in real-life situations, sound events are usually not isolated …
Graph Node Embeddings for ontology-aware Sound Event Classification: an evaluation study
Multi-label Sound Event Classification (SEC) is a challenging task which requires to handle
multiple co-occurring sound event classes. Recent works proposed an ontology-aware …
multiple co-occurring sound event classes. Recent works proposed an ontology-aware …