Explainable and interpretable multimodal large language models: A comprehensive survey

Y Dang, K Huang, J Huo, Y Yan, S Huang, D Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid development of Artificial Intelligence (AI) has revolutionized numerous fields, with
large language models (LLMs) and computer vision (CV) systems driving advancements in …

Surveying the mllm landscape: A meta-review of current surveys

M Li, K Chen, Z Bi, M Liu, B Peng, Q Niu, J Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
The rise of Multimodal Large Language Models (MLLMs) has become a transformative force
in the field of artificial intelligence, enabling machines to process and generate content …

Safe-CLIP: Removing NSFW concepts from vision-and-language models

S Poppi, T Poppi, F Cocchi, M Cornia, L Baraldi… - … on Computer Vision, 2024 - Springer
Large-scale vision-and-language models, such as CLIP, are typically trained on web-scale
data, which can introduce inappropriate content and lead to the development of unsafe and …

BRIDGE: Bridging Gaps in Image Captioning Evaluation with Stronger Visual Cues

S Sarto, M Cornia, L Baraldi, R Cucchiara - European Conference on …, 2024 - Springer
Effectively aligning with human judgment when evaluating machine-generated image
captions represents a complex yet intriguing challenge. Existing evaluation metrics like …

Wiki-LLaVA: Hierarchical Retrieval-Augmented Generation for Multimodal LLMs

D Caffagni, F Cocchi, N Moratelli… - Proceedings of the …, 2024 - openaccess.thecvf.com
Multimodal LLMs are the natural evolution of LLMs and enlarge their capabilities so as to
work beyond the pure textual modality. As research is being carried out to design novel …

The Synergy between Data and Multi-Modal Large Language Models: A Survey from Co-Development Perspective

Z Qin, D Chen, W Zhang, L Yao, Y Huang… - arxiv preprint arxiv …, 2024 - arxiv.org
The rapid development of large language models (LLMs) has been witnessed in recent
years. Based on the powerful LLMs, multi-modal LLMs (MLLMs) extend the modality from …

Computer audition: From task-specific machine learning to foundation models

A Triantafyllopoulos, I Tsangko, A Gebhard… - arxiv preprint arxiv …, 2024 - arxiv.org
Foundation models (FMs) are increasingly spearheading recent advances on a variety of
tasks that fall under the purview of computer audition--the use of machines to understand …

Revisiting Image Captioning Training Paradigm via Direct CLIP-based Optimization

N Moratelli, D Caffagni, M Cornia, L Baraldi… - arxiv preprint arxiv …, 2024 - arxiv.org
The conventional training approach for image captioning involves pre-training a network
using teacher forcing and subsequent fine-tuning with Self-Critical Sequence Training to …

Personalizing multimodal large language models for image captioning: an Experimental analysis

D Bucciarelli, N Moratelli, M Cornia, L Baraldi… - arxiv preprint arxiv …, 2024 - arxiv.org
The task of image captioning demands an algorithm to generate natural language
descriptions of visual inputs. Recent advancements have seen a convergence between …

A survey on the memory mechanism of large language model based agents

Z Zhang, X Bo, C Ma, R Li, X Chen, Q Dai, J Zhu… - arxiv preprint arxiv …, 2024 - arxiv.org
Large language model (LLM) based agents have recently attracted much attention from the
research and industry communities. Compared with original LLMs, LLM-based agents are …