Retrieval-augmented generation for ai-generated content: A survey

P Zhao, H Zhang, Q Yu, Z Wang, Y Geng, F Fu… - arxiv preprint arxiv …, 2024 - arxiv.org
The development of Artificial Intelligence Generated Content (AIGC) has been facilitated by
advancements in model algorithms, scalable foundation model architectures, and the …

Fact: Teaching mllms with faithful, concise and transferable rationales

M Gao, S Chen, L Pang, Y Yao, J Dang… - Proceedings of the …, 2024 - dl.acm.org
The remarkable performance of Multimodal Large Language Models (MLLMs) has
demonstrated their proficient understanding capabilities in handling various visual tasks …

Generalist virtual agents: A survey on autonomous agents across digital platforms

M Gao, W Bu, B Miao, Y Wu, Y Li, J Li, S Tang… - arxiv preprint arxiv …, 2024 - arxiv.org
In this paper, we introduce the Generalist Virtual Agent (GVA), an autonomous entity
engineered to function across diverse digital platforms and environments, assisting users by …

Unified Generative and Discriminative Training for Multi-modal Large Language Models

W Chow, J Li, Q Yu, K Pan, H Fei, Z Ge, S Yang… - arxiv preprint arxiv …, 2024 - arxiv.org
In recent times, Vision-Language Models (VLMs) have been trained under two predominant
paradigms. Generative training has enabled Multimodal Large Language Models (MLLMs) …

Can We Generate Visual Programs Without Prompting LLMs?

M Shlapentokh-Rothman, YX Wang… - arxiv preprint arxiv …, 2024 - arxiv.org
Visual programming prompts LLMs (large language mod-els) to generate executable code
for visual tasks like visual question answering (VQA). Prompt-based methods are difficult to …

Mastering Collaborative Multi-modal Data Selection: A Focus on Informativeness, Uniqueness, and Representativeness

Q Yu, Z Shen, Z Yue, Y Wu, W Zhang, Y Li, J Li… - arxiv preprint arxiv …, 2024 - arxiv.org
Instruction tuning fine-tunes pre-trained Multi-modal Large Language Models (MLLMs) to
handle real-world tasks. However, the rapid expansion of visual instruction datasets …

VDebugger: Harnessing Execution Feedback for Debugging Visual Programs

X Wu, Z Lin, S Zhao, TL Wu, P Lu, N Peng… - arxiv preprint arxiv …, 2024 - arxiv.org
Visual programs are executable code generated by large language models to address
visual reasoning problems. They decompose complex questions into multiple reasoning …

PropTest: Automatic Property Testing for Improved Visual Programming

J Koo, Z Yang, P Cascante-Bonilla, B Ray… - arxiv preprint arxiv …, 2024 - arxiv.org
Visual Programming has emerged as an alternative to end-to-end black-box visual
reasoning models. This type of methods leverage Large Language Models (LLMs) to …

Iris: Breaking GUI Complexity with Adaptive Focus and Self-Refining

Z Ge, J Li, X Pang, M Gao, K Pan, W Lin, H Fei… - arxiv preprint arxiv …, 2024 - arxiv.org
Digital agents are increasingly employed to automate tasks in interactive digital
environments such as web pages, software applications, and operating systems. While text …