Audio-visual cross-attention network for robotic speaker tracking

X Qian, Z Wang, J Wang, G Guan… - IEEE/ACM Transactions …, 2022 - ieeexplore.ieee.org
Audio-visual signals can be used jointly for robotic perception as they complement each
other. Such multi-modal sensory fusion has a clear advantage, especially under noisy …

Transfer learning of wav2vec 2.0 for automatic lyric transcription

L Ou, X Gu, Y Wang - arxiv preprint arxiv:2207.09747, 2022 - arxiv.org
Automatic speech recognition (ASR) has progressed significantly in recent years due to the
emergence of large-scale datasets and the self-supervised learning (SSL) paradigm …

Lyricwhiz: Robust multilingual zero-shot lyrics transcription by whispering to chatgpt

L Zhuo, R Yuan, J Pan, Y Ma, Y Li, G Zhang… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce LyricWhiz, a robust, multilingual, and zero-shot automatic lyrics transcription
method achieving state-of-the-art performance on various lyrics transcription datasets, even …

Generate, discriminate and contrast: A semi-supervised sentence representation learning framework

Y Chen, Y Zhang, B Wang, Z Liu, H Li - arxiv preprint arxiv:2210.16798, 2022 - arxiv.org
Most sentence embedding techniques heavily rely on expensive human-annotated
sentence pairs as the supervised signals. Despite the use of large-scale unlabeled data, the …

Few-shot class-incremental audio classification via discriminative prototype learning

W **e, Y Li, Q He, W Cao - Expert Systems with Applications, 2023 - Elsevier
In real-world scenarios, new audio classes with insufficient samples usually emerge
continually, which motivates the study of few-shot class-incremental audio classification …

Predict-and-update network: Audio-visual speech recognition inspired by human speech perception

J Wang, X Qian, H Li - IEEE/ACM Transactions on Audio …, 2024 - ieeexplore.ieee.org
Audio and visual signals complement each other in human speech perception, and the
same applies to automatic speech recognition. The visual signal is less evident than the …

Dynamic transformers provide a false sense of efficiency

Y Chen, S Chen, Z Li, W Yang, C Liu, RT Tan… - arxiv preprint arxiv …, 2023 - arxiv.org
Despite much success in natural language processing (NLP), pre-trained language models
typically lead to a high computational cost during inference. Multi-exit is a mainstream …

[HTML][HTML] Wagner Ring Dataset: A complex opera scenario for music processing and computational musicology

C Weiß, V Arifi-Müller, M Krause… - Transactions of the …, 2023 - transactions.ismir.net
This paper introduces the Wagner Ring Dataset (WRD), a multi-modal and multi-version
resource on the large-scale opera cycle Der Ring des Nibelungen by Richard Wagner. The …

Polyscriber: Integrated fine-tuning of extractor and lyrics transcriber for polyphonic music

X Gao, C Gupta, H Li - IEEE/ACM Transactions on Audio …, 2023 - ieeexplore.ieee.org
Lyrics transcription of polyphonic music is challenging as the background music affects lyrics
intelligibility. Typically, lyrics transcription can be performed by a two-step pipeline, ie a …

Elucidate gender fairness in singing voice transcription

X Gu, W Zeng, Y Wang - Proceedings of the 31st ACM International …, 2023 - dl.acm.org
It is widely known that males and females typically possess different sound characteristics
when singing, such as timbre and pitch, but it has never been explored whether these …