Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Vlp: A survey on vision-language pre-training
In the past few years, the emergence of pre-training models has brought uni-modal fields
such as computer vision (CV) and natural language processing (NLP) to a new era …
such as computer vision (CV) and natural language processing (NLP) to a new era …
Large language models as commonsense knowledge for large-scale task planning
Large-scale task planning is a major challenge. Recent work exploits large language
models (LLMs) directly as a policy and shows surprisingly interesting results. This paper …
models (LLMs) directly as a policy and shows surprisingly interesting results. This paper …
Language models as zero-shot planners: Extracting actionable knowledge for embodied agents
Can world knowledge learned by large language models (LLMs) be used to act in
interactive environments? In this paper, we investigate the possibility of grounding high-level …
interactive environments? In this paper, we investigate the possibility of grounding high-level …
Interactive language: Talking to robots in real time
We present a framework for building interactive, real-time, natural language-instructable
robots in the real world, and we open source related assets (dataset, environment …
robots in the real world, and we open source related assets (dataset, environment …
Lavt: Language-aware vision transformer for referring image segmentation
Referring image segmentation is a fundamental vision-language task that aims to segment
out an object referred to by a natural language expression from an image. One of the key …
out an object referred to by a natural language expression from an image. One of the key …
How much can clip benefit vision-and-language tasks?
Most existing Vision-and-Language (V&L) models rely on pre-trained visual encoders, using
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
a relatively small set of manually-annotated data (as compared to web-crawled data), to …
History aware multimodal transformer for vision-and-language navigation
Vision-and-language navigation (VLN) aims to build autonomous visual agents that follow
instructions and navigate in real scenes. To remember previously visited locations and …
instructions and navigate in real scenes. To remember previously visited locations and …
A survey on deep multimodal learning for computer vision: advances, trends, applications, and datasets
K Bayoudh, R Knani, F Hamdaoui, A Mtibaa - The Visual Computer, 2022 - Springer
The research progress in multimodal learning has grown rapidly over the last decade in
several areas, especially in computer vision. The growing potential of multimodal data …
several areas, especially in computer vision. The growing potential of multimodal data …
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …