Multimodal research in vision and language: A review of current and emerging trends
Deep Learning and its applications have cascaded impactful research and development
with a diverse range of modalities present in the real-world data. More recently, this has …
with a diverse range of modalities present in the real-world data. More recently, this has …
Video pivoting unsupervised multi-modal machine translation
The main challenge in the field of unsupervised machine translation (UMT) is to associate
source-target sentences in the latent space. As people who speak different languages share …
source-target sentences in the latent space. As people who speak different languages share …
Support-set bottlenecks for video-text representation learning
The dominant paradigm for learning video-text representations--noise contrastive learning--
increases the similarity of the representations of pairs of samples that are known to be …
increases the similarity of the representations of pairs of samples that are known to be …
Experience grounds language
Language understanding research is held back by a failure to relate language to the
physical world it describes and to the social interactions it facilitates. Despite the incredible …
physical world it describes and to the social interactions it facilitates. Despite the incredible …
Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis
Multimodal sentiment analysis aims to extract and integrate semantic information collected
from multiple modalities to recognize the expressed emotions and sentiment in multimodal …
from multiple modalities to recognize the expressed emotions and sentiment in multimodal …
Deep vision multimodal learning: Methodology, benchmark, and trend
Deep vision multimodal learning aims at combining deep visual representation learning with
other modalities, such as text, sound, and data collected from other sensors. With the fast …
other modalities, such as text, sound, and data collected from other sensors. With the fast …
Scene graph as pivoting: Inference-time image-free unsupervised multimodal machine translation with visual scene hallucination
In this work, we investigate a more realistic unsupervised multimodal machine translation
(UMMT) setup, inference-time image-free UMMT, where the model is trained with source-text …
(UMMT) setup, inference-time image-free UMMT, where the model is trained with source-text …
IGLUE: A benchmark for transfer learning across modalities, tasks, and languages
Reliable evaluation benchmarks designed for replicability and comprehensiveness have
driven progress in machine learning. Due to the lack of a multilingual benchmark, however …
driven progress in machine learning. Due to the lack of a multilingual benchmark, however …
Cross2StrA: Unpaired cross-lingual image captioning with cross-lingual cross-modal structure-pivoted alignment
Unpaired cross-lingual image captioning has long suffered from irrelevancy and disfluency
issues, due to the inconsistencies of the semantic scene and syntax attributes during …
issues, due to the inconsistencies of the semantic scene and syntax attributes during …
Uc2: Universal cross-lingual cross-modal vision-and-language pre-training
Vision-and-language pre-training has achieved impressive success in learning multimodal
representations between vision and language. To generalize this success to non-English …
representations between vision and language. To generalize this success to non-English …