Parameter-efficient fine-tuning for large models: A comprehensive survey
Large models represent a groundbreaking advancement in multiple application fields,
enabling remarkable achievements across various tasks. However, their unprecedented …
enabling remarkable achievements across various tasks. However, their unprecedented …
Foundations & trends in multimodal machine learning: Principles, challenges, and open questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
Llama-adapter v2: Parameter-efficient visual instruction model
How to efficiently transform large language models (LLMs) into instruction followers is
recently a popular research direction, while training LLM for multi-modal reasoning remains …
recently a popular research direction, while training LLM for multi-modal reasoning remains …
Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …
One small step for generative ai, one giant leap for agi: A complete survey on chatgpt in aigc era
OpenAI has recently released GPT-4 (aka ChatGPT plus), which is demonstrated to be one
small step for generative AI (GAI), but one giant leap for artificial general intelligence (AGI) …
small step for generative AI (GAI), but one giant leap for artificial general intelligence (AGI) …
Frozen in time: A joint video and image encoder for end-to-end retrieval
Our objective in this work is video-text retrieval-in particular a joint embedding that enables
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
efficient text-to-video retrieval. The challenges in this area include the design of the visual …
HallusionBench: an advanced diagnostic suite for entangled language hallucination and visual illusion in large vision-language models
We introduce" HallusionBench" a comprehensive benchmark designed for the evaluation of
image-context reasoning. This benchmark presents significant challenges to advanced large …
image-context reasoning. This benchmark presents significant challenges to advanced large …
Large language models are visual reasoning coordinators
Visual reasoning requires multimodal perception and commonsense cognition of the world.
Recently, multiple vision-language models (VLMs) have been proposed with excellent …
Recently, multiple vision-language models (VLMs) have been proposed with excellent …
[PDF][PDF] HallusionBench: An Advanced Diagnostic Suite for Entangled Language Hallucination and Visual Illusion in Large Vision-Language Models
Large language models (LLMs), after being aligned with vision models and integrated into
vision-language models (VLMs), can bring impressive improvement in image reasoning …
vision-language models (VLMs), can bring impressive improvement in image reasoning …
Foundations and Trends in Multimodal Machine Learning: Principles, Challenges, and Open Questions
Multimodal machine learning is a vibrant multi-disciplinary research field that aims to design
computer agents with intelligent capabilities such as understanding, reasoning, and learning …
computer agents with intelligent capabilities such as understanding, reasoning, and learning …