Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning
Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract
reasoning ability is the goal of next-generation AI. Recent advancements in Large Language …
reasoning ability is the goal of next-generation AI. Recent advancements in Large Language …
Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning
In this study, we are interested in imbuing robots with the capability of physically-grounded
task planning. Recent advancements have shown that large language models (LLMs) …
task planning. Recent advancements have shown that large language models (LLMs) …
Aligning text-to-image diffusion models with reward backpropagation
M Prabhudesai, A Goyal, D Pathak… - arxiv preprint arxiv …, 2023 - arxiv.org
Text-to-image diffusion models have recently emerged at the forefront of image generation,
powered by very large-scale unsupervised or weakly supervised text-to-image training …
powered by very large-scale unsupervised or weakly supervised text-to-image training …
Real-time anomaly detection and reactive planning with large language models
Foundation models, eg, large language models (LLMs), trained on internet-scale data
possess zero-shot generalization capabilities that make them a promising technology …
possess zero-shot generalization capabilities that make them a promising technology …
Clip as rnn: Segment countless visual concepts without training endeavor
Existing open-vocabulary image segmentation methods require a fine-tuning step on mask
labels and/or image-text datasets. Mask labels are labor-intensive which limits the number of …
labels and/or image-text datasets. Mask labels are labor-intensive which limits the number of …
LLM multimodal traffic accident forecasting
With the rise in traffic congestion in urban centers, predicting accidents has become
paramount for city planning and public safety. This work comprehensively studied the …
paramount for city planning and public safety. This work comprehensively studied the …
How to prompt your robot: A promptbook for manipulation skills with code as policies
Large Language Models (LLMs) have demonstrated the ability to perform semantic
reasoning, planning and write code for robotics tasks. However, most methods rely on pre …
reasoning, planning and write code for robotics tasks. However, most methods rely on pre …
Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics
Recent advances in text-to-image synthesis have been enabled by exploiting a combination
of language and vision through foundation models. These models are pre-trained on …
of language and vision through foundation models. These models are pre-trained on …
Context-aware entity grounding with open-vocabulary 3d scene graphs
We present an Open-Vocabulary 3D Scene Graph (OVSG), a formal framework for
grounding a variety of entities, such as object instances, agents, and regions, with free-form …
grounding a variety of entities, such as object instances, agents, and regions, with free-form …
Language-driven visual consensus for zero-shot semantic segmentation
The pre-trained vision-language model, exemplified by CLIP [1], advances zero-shot
semantic segmentation by aligning visual features with class embeddings through a …
semantic segmentation by aligning visual features with class embeddings through a …