Parameter-efficient fine-tuning for large models: A comprehensive survey

Z Han, C Gao, J Liu, J Zhang, SQ Zhang - arxiv preprint arxiv:2403.14608, 2024 - arxiv.org
Large models represent a groundbreaking advancement in multiple application fields,
enabling remarkable achievements across various tasks. However, their unprecedented …

Foundations & trends in multimodal machine learning: Principles, challenges, and open questions

PP Liang, A Zadeh, LP Morency - ACM Computing Surveys, 2024 - dl.acm.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …

Llama-adapter v2: Parameter-efficient visual instruction model

P Gao, J Han, R Zhang, Z Lin, S Geng, A Zhou… - arxiv preprint arxiv …, 2023 - arxiv.org
How to efficiently transform large language models (LLMs) into instruction followers is
recently a popular research direction, while training LLM for multi-modal reasoning remains …

Multimodal foundation models: From specialists to general-purpose assistants

C Li, Z Gan, Z Yang, J Yang, L Li… - … and Trends® in …, 2024 - nowpublishers.com
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …

One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era

C Zhang, C Zhang, C Li, Y Qiao, S Zheng… - arxiv preprint arxiv …, 2023 - arxiv.org
OpenAI has recently released GPT-4 (aka ChatGPT plus), which is demonstrated to be one
small step for generative AI (GAI), but one giant leap for artificial general intelligence (AGI) …

Frozen in time: A joint video and image encoder for end-to-end retrieval

M Bain, A Nagrani, G Varol… - Proceedings of the …, 2021 - openaccess.thecvf.com
Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …

HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models

T Guan, F Liu, X Wu, R **an, Z Li… - Proceedings of the …, 2024 - openaccess.thecvf.com
We introduce" HallusionBench" a comprehensive benchmark designed for the evaluation of
image-context reasoning. This benchmark presents significant challenges to advanced large …

Large language models are visual reasoning coordinators

L Chen, B Li, S Shen, J Yang, C Li… - Advances in …, 2024 - proceedings.neurips.cc
Visual reasoning requires multimodal perception and commonsense cognition of the world.
Recently, multiple vision-language models (VLMs) have been proposed with excellent …

[PDF][PDF] HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models

T Guan, F Liu, X Wu, R **an, Z Li, X Liu… - arxiv preprint arxiv …, 2023 - researchgate.net
Large language models (LLMs), after being aligned with vision models and integrated into
vision-language models (VLMs), can bring impressive improvement in image reasoning …

Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions

PP Liang, A Zadeh, LP Morency - arxiv preprint arxiv:2209.03430, 2022 - arxiv.org
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …