Foundation Models Defining a New Era in Vision: a Survey and Outlook
Vision systems that see and reason about the compositional nature of visual scenes are
fundamental to understanding our world. The complex relations between objects and their …
fundamental to understanding our world. The complex relations between objects and their …
When large language models meet personalization: Perspectives of challenges and opportunities
The advent of large language models marks a revolutionary breakthrough in artificial
intelligence. With the unprecedented scale of training and model parameters, the capability …
intelligence. With the unprecedented scale of training and model parameters, the capability …
Palm-e: An embodied multimodal language model
Large language models excel at a wide range of complex tasks. However, enabling general
inference in the real world, eg, for robotics problems, raises the challenge of grounding. We …
inference in the real world, eg, for robotics problems, raises the challenge of grounding. We …
Language is not all you need: Aligning perception with language models
A big convergence of language, multimodal perception, action, and world modeling is a key
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …
step toward artificial general intelligence. In this work, we introduce KOSMOS-1, a …
Rt-2: Vision-language-action models transfer web knowledge to robotic control
A Brohan, N Brown, J Carbajal, Y Chebotar… - arxiv preprint arxiv …, 2023 - arxiv.org
We study how vision-language models trained on Internet-scale data can be incorporated
directly into end-to-end robotic control to boost generalization and enable emergent …
directly into end-to-end robotic control to boost generalization and enable emergent …
Augmented language models: a survey
This survey reviews works in which language models (LMs) are augmented with reasoning
skills and the ability to use tools. The former is defined as decomposing a potentially …
skills and the ability to use tools. The former is defined as decomposing a potentially …
Kosmos-2: Grounding multimodal large language models to the world
We introduce Kosmos-2, a Multimodal Large Language Model (MLLM), enabling new
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …
capabilities of perceiving object descriptions (eg, bounding boxes) and grounding text to the …
Openagi: When llm meets domain experts
Human Intelligence (HI) excels at combining basic skills to solve complex tasks. This
capability is vital for Artificial Intelligence (AI) and should be embedded in comprehensive AI …
capability is vital for Artificial Intelligence (AI) and should be embedded in comprehensive AI …
[HTML][HTML] Rt-2: Vision-language-action models transfer web knowledge to robotic control
We study how vision-language models trained on Internet-scale data can be incorporated
directly into end-to-end robotic control to boost generalization and enable emergent …
directly into end-to-end robotic control to boost generalization and enable emergent …
Multimodal chain-of-thought reasoning in language models
Large language models (LLMs) have shown impressive performance on complex reasoning
by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains …
by leveraging chain-of-thought (CoT) prompting to generate intermediate reasoning chains …