Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions
Multimodal deep learning systems that employ multiple modalities like text, image, audio,
video, etc., are showing better performance than individual modalities (ie, unimodal) …
video, etc., are showing better performance than individual modalities (ie, unimodal) …
A review on explainability in multimodal deep neural nets
Artificial Intelligence techniques powered by deep neural nets have achieved much success
in several application domains, most significantly and notably in the Computer Vision …
in several application domains, most significantly and notably in the Computer Vision …
Language-driven artistic style transfer
Despite having promising results, style transfer, which requires preparing style images in
advance, may result in lack of creativity and accessibility. Following human instruction, on …
advance, may result in lack of creativity and accessibility. Following human instruction, on …
Multimodal research in vision and language: A review of current and emerging trends
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …
with a diverse range of modalities present in the real-world data. More recently, this has …
P4contrast: Contrastive learning with pairs of point-pixel pairs for rgb-d scene understanding
Self-supervised representation learning is a critical problem in computer vision, as it
provides a way to pretrain feature extractors on large unlabeled datasets that can be used …
provides a way to pretrain feature extractors on large unlabeled datasets that can be used …
Modality-aware contrastive instance learning with self-distillation for weakly-supervised audio-visual violence detection
Weakly-supervised audio-visual violence detection aims to distinguish snippets containing
multimodal violence events with video-level labels. Many prior works perform audio-visual …
multimodal violence events with video-level labels. Many prior works perform audio-visual …
A simple long-tailed recognition baseline via vision-language model
The visual world naturally exhibits a long-tailed distribution of open classes, which poses
great challenges to modern visual systems. Existing approaches either perform class re …
great challenges to modern visual systems. Existing approaches either perform class re …
A closer look at the robustness of vision-and-language pre-trained models
Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have
propelled the state of the art in vision-and-language (V+ L) research to a new level. Although …
propelled the state of the art in vision-and-language (V+ L) research to a new level. Although …
Multilingual molecular representation learning via contrastive pre-training
Molecular representation learning plays an essential role in cheminformatics. Recently,
language model-based approaches have gained popularity as an alternative to traditional …
language model-based approaches have gained popularity as an alternative to traditional …
Dynamic graph representation learning for video dialog via multi-modal shuffled transformers
Given an input video, its associated audio, and a brief caption, the audio-visual scene aware
dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human …
dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human …