Deep learning-based multimodal emotion recognition from audio, visual, and text modalities: A systematic review of recent advancements and future prospects

S Zhang, Y Yang, C Chen, X Zhang, Q Leng… - Expert Systems with …, 2024 - Elsevier
Emotion recognition has recently attracted extensive interest due to its significant
applications to human–computer interaction. The expression of human emotion depends on …

[HTML][HTML] Battery safety: Machine learning-based prognostics

J Zhao, X Feng, Q Pang, M Fowler, Y Lian… - Progress in Energy and …, 2024 - Elsevier
Lithium-ion batteries play a pivotal role in a wide range of applications, from electronic
devices to large-scale electrified transportation systems and grid-scale energy storage …

Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models

R Huang, J Huang, D Yang, Y Ren… - International …, 2023 - proceedings.mlr.press
Large-scale multimodal generative modeling has created milestones in text-to-image and
text-to-video generation. Its application to audio still lags behind for two main reasons: the …

Unireplknet: A universal perception large-kernel convnet for audio video point cloud time-series and image recognition

X Ding, Y Zhang, Y Ge, S Zhao… - Proceedings of the …, 2024 - openaccess.thecvf.com
Large-kernel convolutional neural networks (ConvNets) have recently received extensive
research attention but two unresolved and critical issues demand further investigation. 1) …

Beats: Audio pre-training with acoustic tokenizers

S Chen, Y Wu, C Wang, S Liu, D Tompkins… - arxiv preprint arxiv …, 2022 - arxiv.org
The massive growth of self-supervised learning (SSL) has been witnessed in language,
vision, speech, and audio domains over the past few years. While discrete label prediction is …

Masked autoencoders that listen

PY Huang, H Xu, J Li, A Baevski… - Advances in …, 2022 - proceedings.neurips.cc
This paper studies a simple extension of image-based Masked Autoencoders (MAE) to self-
supervised representation learning from audio spectrograms. Following the Transformer …

Wavlm: Large-scale self-supervised pre-training for full stack speech processing

S Chen, C Wang, Z Chen, Y Wu, S Liu… - IEEE Journal of …, 2022 - ieeexplore.ieee.org
Self-supervised learning (SSL) achieves great success in speech recognition, while limited
exploration has been attempted for other speech processing tasks. As speech signal …

Automatic speech recognition using advanced deep learning approaches: A survey

H Kheddar, M Hemis, Y Himeur - Information Fusion, 2024 - Elsevier
Recent advancements in deep learning (DL) have posed a significant challenge for
automatic speech recognition (ASR). ASR relies on extensive training datasets, including …

Mulan: A joint embedding of music audio and natural language

Q Huang, A Jansen, J Lee, R Ganti, JY Li… - arxiv preprint arxiv …, 2022 - arxiv.org
Music tagging and content-based retrieval systems have traditionally been constructed
using pre-defined ontologies covering a rigid set of music attributes or text queries. This …

Contrastive audio-visual masked autoencoder

Y Gong, A Rouditchenko, AH Liu, D Harwath… - arxiv preprint arxiv …, 2022 - arxiv.org
In this paper, we first extend the recent Masked Auto-Encoder (MAE) model from a single
modality to audio-visual multi-modalities. Subsequently, we propose the Contrastive Audio …