Turnitin
降AI改写
早检测系统
早降重系统
Turnitin-UK版
万方检测-期刊版
维普编辑部版
Grammarly检测
Paperpass检测
checkpass检测
PaperYY检测
Audio-visual speech and gesture recognition by sensors of mobile devices
Audio-visual speech recognition (AVSR) is one of the most promising solutions for reliable
speech recognition, particularly when audio is corrupted by noise. Additional visual …
speech recognition, particularly when audio is corrupted by noise. Additional visual …
Formalizing multimedia recommendation through multimodal deep learning
Recommender systems (RSs) provide customers with a personalized navigation experience
within the vast catalogs of products and services offered on popular online platforms …
within the vast catalogs of products and services offered on popular online platforms …
Training strategies to handle missing modalities for audio-visual expression recognition
Automatic audio-visual expression recognition can play an important role in communication
services such as tele-health, VOIP calls and human-machine interaction. Accuracy of audio …
services such as tele-health, VOIP calls and human-machine interaction. Accuracy of audio …
Localtrans: A multiscale local transformer network for cross-resolution homography estimation
Cross-resolution image alignment is a key problem in multiscale gigapixel photography,
which requires to estimate homography matrix using images with large resolution gap …
which requires to estimate homography matrix using images with large resolution gap …
Mmlatch: Bottom-up top-down fusion for multimodal sentiment analysis
Current deep learning approaches for multimodal fusion rely on bottom-up fusion of high
and mid-level latent modality representations (late/mid fusion) or low level sensory inputs …
and mid-level latent modality representations (late/mid fusion) or low level sensory inputs …
Robust audiovisual emotion recognition: Aligning modalities, capturing temporal information, and handling missing features
Emotion recognition using audiovisual features is a challenging task for human-machine
interaction systems. Under ideal conditions (perfect illumination, clean speech signals, and …
interaction systems. Under ideal conditions (perfect illumination, clean speech signals, and …
Avformer: Injecting vision into frozen speech models for zero-shot av-asr
Audiovisual automatic speech recognition (AV-ASR) aims to improve the robustness of a
speech recognition system by incorporating visual information. Training fully supervised …
speech recognition system by incorporating visual information. Training fully supervised …
Self-supervised learning with cross-modal transformers for emotion recognition
Emotion recognition is a challenging task due to limited availability of in-the-wild labeled
datasets. Self-supervised learning has shown improvements on tasks with limited labeled …
datasets. Self-supervised learning has shown improvements on tasks with limited labeled …
Auditory attention detection via cross-modal attention
S Cai, P Li, E Su, L **e - Frontiers in neuroscience, 2021 - frontiersin.org
Humans show a remarkable perceptual ability to select the speech stream of interest among
multiple competing speakers. Previous studies demonstrated that auditory attention …
multiple competing speakers. Previous studies demonstrated that auditory attention …
ASR-aware end-to-end neural diarization
We present a Conformer-based end-to-end neural diarization (EEND) model that uses both
acoustic input and features derived from an automatic speech recognition (ASR) model. Two …
acoustic input and features derived from an automatic speech recognition (ASR) model. Two …