A survey on multimodal bidirectional machine learning translation of image and natural language processing
W Nam, B Jang - Expert Systems with Applications, 2024 - Elsevier
Advances in multimodal machine learning help artificial intelligence to resemble human
intellect more closely, which perceives the world from multiple modalities. We surveyed state …
intellect more closely, which perceives the world from multiple modalities. We surveyed state …
Pali: A jointly-scaled multilingual language-image model
Effective scaling and a flexible task interface enable large language models to excel at many
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …
tasks. We present PaLI (Pathways Language and Image model), a model that extends this …
Tifa: Accurate and interpretable text-to-image faithfulness evaluation with question answering
Despite thousands of researchers, engineers, and artists actively working on improving text-
to-image generation models, systems often fail to produce images that accurately align with …
to-image generation models, systems often fail to produce images that accurately align with …
Paligemma: A versatile 3b vlm for transfer
PaliGemma is an open Vision-Language Model (VLM) that is based on the SigLIP-So400m
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …
vision encoder and the Gemma-2B language model. It is trained to be a versatile and …
Pali-x: On scaling up a multilingual vision and language model
We present the training recipe and results of scaling up PaLI-X, a multilingual vision and
language model, both in terms of size of the components and the breadth of its training task …
language model, both in terms of size of the components and the breadth of its training task …
From images to textual prompts: Zero-shot visual question answering with frozen large language models
Large language models (LLMs) have demonstrated excellent zero-shot generalization to
new language tasks. However, effective utilization of LLMs for zero-shot visual question …
new language tasks. However, effective utilization of LLMs for zero-shot visual question …
What you see is what you read? improving text-image alignment evaluation
Automatically determining whether a text and a corresponding image are semantically
aligned is a significant challenge for vision-language models, with applications in generative …
aligned is a significant challenge for vision-language models, with applications in generative …
BRAVE: Broadening the visual encoding of vision-language models
Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …
language model (LM) that interprets the encoded features to solve downstream tasks …
From image to language: A critical analysis of visual question answering (vqa) approaches, challenges, and opportunities
The multimodal task of Visual Question Answering (VQA) encompassing elements of
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …
Computer Vision (CV) and Natural Language Processing (NLP), aims to generate answers …
Plug-and-play vqa: Zero-shot vqa by conjoining large pretrained models with zero training
Visual question answering (VQA) is a hallmark of vision and language reasoning and a
challenging task under the zero-shot setting. We propose Plug-and-Play VQA (PNP-VQA), a …
challenging task under the zero-shot setting. We propose Plug-and-Play VQA (PNP-VQA), a …