Human action recognition from various data modalities: A review
Human Action Recognition (HAR) aims to understand human behavior and assign a label to
each action. It has a wide range of applications, and therefore has been attracting increasing …
each action. It has a wide range of applications, and therefore has been attracting increasing …
Computational bioacoustics with deep learning: a review and roadmap
D Stowell - PeerJ, 2022 - peerj.com
Animal vocalisations and natural soundscapes are fascinating objects of study, and contain
valuable evidence about animal behaviours, populations and ecosystems. They are studied …
valuable evidence about animal behaviours, populations and ecosystems. They are studied …
Audioldm: Text-to-audio generation with latent diffusion models
Text-to-audio (TTA) system has recently gained attention for its ability to synthesize general
audio based on text descriptions. However, previous studies in TTA have limited generation …
audio based on text descriptions. However, previous studies in TTA have limited generation …
Videopoet: A large language model for zero-shot video generation
We present VideoPoet, a language model capable of synthesizing high-quality video, with
matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder …
matching audio, from a large variety of conditioning signals. VideoPoet employs a decoder …
Masked autoencoders that listen
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-
supervised representation learning from audio spectrograms. Following the Transformer …
supervised representation learning from audio spectrograms. Following the Transformer …
Attention bottlenecks for multimodal fusion
Humans perceive the world by concurrently processing and fusing high-dimensional inputs
from multiple modalities such as vision and audio. Machine perception models, in stark …
from multiple modalities such as vision and audio. Machine perception models, in stark …
Noise2music: Text-conditioned music generation with diffusion models
We introduce Noise2Music, where a series of diffusion models is trained to generate high-
quality 30-second music clips from text prompts. Two types of diffusion models, a generator …
quality 30-second music clips from text prompts. Two types of diffusion models, a generator …
Contrastive learning for cold-start recommendation
Recommending purely cold-start items is a long-standing and fundamental challenge in the
recommender systems. Without any historical interaction on cold-start items, the …
recommender systems. Without any historical interaction on cold-start items, the …
Fsd50k: an open dataset of human-labeled sound events
Most existing datasets for sound event recognition (SER) are relatively small and/or domain-
specific, with the exception of AudioSet, based on over 2 M tracks from YouTube videos and …
specific, with the exception of AudioSet, based on over 2 M tracks from YouTube videos and …
Multi-modal transformer for video retrieval
The task of retrieving video content relevant to natural language queries plays a critical role
in effectively handling internet-scale datasets. Most of the existing methods for this caption-to …
in effectively handling internet-scale datasets. Most of the existing methods for this caption-to …