Foundation Models Defining a New Era in Vision: a Survey and Outlook

M Awais, M Naseer, S Khan, RM Anwer… - … on Pattern Analysis …, 2025 - ieeexplore.ieee.org
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …

When large language models meet personalization: Perspectives of challenges and opportunities

J Chen, Z Liu, X Huang, C Wu, Q Liu, G Jiang, Y Pu… - World Wide Web, 2024 - Springer
The advent of large language models marks a revolutionary breakthrough in artificial
intelligence. With the unprecedented scale of training and model parameters, the capability …

Palm-e: An embodied multimodal language model

D Driess, F **a, MSM Sajjadi, C Lynch… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models excel at a wide range of complex tasks. However, enabling general
inference in the real world, eg, for robotics problems, raises the challenge of grounding. We …

Language is not all you need: Aligning perception with language models

S Huang, L Dong, W Wang, Y Hao… - Advances in …, 2023 - proceedings.neurips.cc
A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …

Rt-2: Vision-language-action models transfer web knowledge to robotic control

A Brohan, N Brown, J Carbajal, Y Chebotar… - arxiv preprint arxiv …, 2023 - arxiv.org
We study how vision-language models trained on Internet-scale data can be incorporated
directly into end-to-end robotic control to boost generalization and enable emergent …

Augmented language models: a survey

G Mialon, R Dessì, M Lomeli, C Nalmpantis… - arxiv preprint arxiv …, 2023 - arxiv.org
This survey reviews works in which language models (LMs) are augmented with reasoning
skills and the ability to use tools. The former is defined as decomposing a potentially …

Kosmos-2: Grounding multimodal large language models to the world

Z Peng, W Wang, L Dong, Y Hao, S Huang… - arxiv preprint arxiv …, 2023 - arxiv.org
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …

Openagi: When llm meets domain experts

Y Ge, W Hua, K Mei, J Tan, S Xu… - Advances in Neural …, 2023 - proceedings.neurips.cc
Human Intelligence (HI) excels at combining basic skills to solve complex tasks. This
capability is vital for Artificial Intelligence (AI) and should be embedded in comprehensive AI …

[HTML][HTML] Rt-2: Vision-language-action models transfer web knowledge to robotic control

B Zitkovich, T Yu, S Xu, P Xu, T **ao… - … on Robot Learning, 2023 - proceedings.mlr.press
We study how vision-language models trained on Internet-scale data can be incorporated
directly into end-to-end robotic control to boost generalization and enable emergent …

Multimodal chain-of-thought reasoning in language models

Z Zhang, A Zhang, M Li, H Zhao, G Karypis… - arxiv preprint arxiv …, 2023 - arxiv.org
Large language models (LLMs) have shown impressive performance on complex reasoning
by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains …