Multimodal co-learning: Challenges, applications with datasets, recent advances and future directions

A Rahate, R Walambe, S Ramanna, K Kotecha - Information Fusion, 2022 - Elsevier
Multimodal deep learning systems that employ multiple modalities like text, image, audio,
video, etc., are showing better performance than individual modalities (ie, unimodal) …

A review on explainability in multimodal deep neural nets

G Joshi, R Walambe, K Kotecha - IEEE Access, 2021 - ieeexplore.ieee.org
Artificial Intelligence techniques powered by deep neural nets have achieved much success
in several application domains, most significantly and notably in the Computer Vision …

Language-driven artistic style transfer

TJ Fu, XE Wang, WY Wang - European Conference on Computer Vision, 2022 - Springer
Despite having promising results, style transfer, which requires preparing style images in
advance, may result in lack of creativity and accessibility. Following human instruction, on …

Multimodal research in vision and language: A review of current and emerging trends

S Uppal, S Bhagat, D Hazarika, N Majumder, S Poria… - Information …, 2022 - Elsevier
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …

P4contrast: Contrastive learning with pairs of point-pixel pairs for rgb-d scene understanding

Y Liu, L Yi, S Zhang, Q Fan, T Funkhouser… - arxiv preprint arxiv …, 2020 - arxiv.org
Self-supervised representation learning is a critical problem in computer vision, as it
provides a way to pretrain feature extractors on large unlabeled datasets that can be used …

Modality-aware contrastive instance learning with self-distillation for weakly-supervised audio-visual violence detection

J Yu, J Liu, Y Cheng, R Feng, Y Zhang - Proceedings of the 30th ACM …, 2022 - dl.acm.org
Weakly-supervised audio-visual violence detection aims to distinguish snippets containing
multimodal violence events with video-level labels. Many prior works perform audio-visual …

A simple long-tailed recognition baseline via vision-language model

T Ma, S Geng, M Wang, J Shao, J Lu, H Li… - arxiv preprint arxiv …, 2021 - arxiv.org
The visual world naturally exhibits a long-tailed distribution of open classes, which poses
great challenges to modern visual systems. Existing approaches either perform class re …

A closer look at the robustness of vision-and-language pre-trained models

L Li, Z Gan, J Liu - arxiv preprint arxiv:2012.08673, 2020 - arxiv.org
Large-scale pre-trained multimodal transformers, such as ViLBERT and UNITER, have
propelled the state of the art in vision-and-language (V+ L) research to a new level. Although …

Multilingual molecular representation learning via contrastive pre-training

Z Guo, P Sharma, A Martinez, L Du… - arxiv preprint arxiv …, 2021 - arxiv.org
Molecular representation learning plays an essential role in cheminformatics. Recently,
language model-based approaches have gained popularity as an alternative to traditional …

Dynamic graph representation learning for video dialog via multi-modal shuffled transformers

S Geng, P Gao, M Chatterjee, C Hori… - Proceedings of the …, 2021 - ojs.aaai.org
Given an input video, its associated audio, and a brief caption, the audio-visual scene aware
dialog (AVSD) task requires an agent to indulge in a question-answer dialog with a human …