Explainable and interpretable multimodal large language models: A comprehensive survey
The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with
large language models (LLMs) and computer vision (CV) systems driving advancements in …
large language models (LLMs) and computer vision (CV) systems driving advancements in …
Surveying the mllm landscape: A meta-review of current surveys
The rise of Multimodal Large Language Models (MLLMs) has become a transformative force
in the field of artificial intelligence, enabling machines to process and generate content …
in the field of artificial intelligence, enabling machines to process and generate content …
Safe-CLIP: Removing NSFW concepts from vision-and-language models
Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale
data, which can introduce inappropriate content and lead to the development of unsafe and …
data, which can introduce inappropriate content and lead to the development of unsafe and …
BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues
Effectively aligning with human judgment when evaluating machine-generated image
captions represents a complex yet intriguing challenge. Existing evaluation metrics like …
captions represents a complex yet intriguing challenge. Existing evaluation metrics like …
Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs
Multimodal LLMs are the natural evolution of LLMs and enlarge their capabilities so as to
work beyond the pure textual modality. As research is being carried out to design novel …
work beyond the pure textual modality. As research is being carried out to design novel …
The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective
The rapid development of large language models (LLMs) has been witnessed in recent
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …
Computer audition: From task-specific machine learning to foundation models
Foundation models (FMs) are increasingly spearheading recent advances on a variety of
tasks that fall under the purview of computer audition--the use of machines to understand …
tasks that fall under the purview of computer audition--the use of machines to understand …
Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization
The conventional training approach for image captioning involves pre-training a network
using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to …
using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to …
Personalizing multimodal large language models for image captioning: an Experimental analysis
The task of image captioning demands an algorithm to generate natural language
descriptions of visual inputs. Recent advancements have seen a convergence between …
descriptions of visual inputs. Recent advancements have seen a convergence between …
A survey on the memory mechanism of large language model based agents
Large language model (LLM) based agents have recently attracted much attention from the
research and industry communities. Compared with original LLMs, LLM-based agents are …
research and industry communities. Compared with original LLMs, LLM-based agents are …