Retrieval-augmented generation for ai-generated content: A survey
The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by
advancements in model algorithms, scalable foundation model architectures, and the …
advancements in model algorithms, scalable foundation model architectures, and the …
Fact: Teaching mllms with faithful, concise and transferable rationales
The remarkable performance of Multimodal Large Language Models (MLLMs) has
demonstrated their proficient understanding capabilities in handling various visual tasks …
demonstrated their proficient understanding capabilities in handling various visual tasks …
Generalist virtual agents: A survey on autonomous agents across digital platforms
In this paper, we introduce the Generalist Virtual Agent (GVA), an autonomous entity
engineered to function across diverse digital platforms and environments, assisting users by …
engineered to function across diverse digital platforms and environments, assisting users by …
Unified Generative and Discriminative Training for Multi-modal Large Language Models
In recent times, Vision-Language Models (VLMs) have been trained under two predominant
paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) …
paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) …
Can We Generate Visual Programs Without Prompting LLMs?
Visual programming prompts LLMs (large language mod-els) to generate executable code
for visual tasks like visual question answering (VQA). Prompt-based methods are difficult to …
for visual tasks like visual question answering (VQA). Prompt-based methods are difficult to …
Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness
Instruction tuning fine-tunes pre-trained Multi-modal Large Language Models (MLLMs) to
handle real-world tasks. However, the rapid expansion of visual instruction datasets …
handle real-world tasks. However, the rapid expansion of visual instruction datasets …
VDebugger: Harnessing Execution Feedback for Debugging Visual Programs
Visual programs are executable code generated by large language models to address
visual reasoning problems. They decompose complex questions into multiple reasoning …
visual reasoning problems. They decompose complex questions into multiple reasoning …
PropTest: Automatic Property Testing for Improved Visual Programming
Visual Programming has emerged as an alternative to end-to-end black-box visual
reasoning models. This type of methods leverage Large Language Models (LLMs) to …
reasoning models. This type of methods leverage Large Language Models (LLMs) to …
Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining
Digital agents are increasingly employed to automate tasks in interactive digital
environments such as web pages, software applications, and operating systems. While text …
environments such as web pages, software applications, and operating systems. While text …