Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning

Y Wang, W Chen, X Han, X Lin, H Zhao, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org
Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract
reasoning ability is the goal of next-generation AI. Recent advancements in Large Language …

Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning

Y Hu, F Lin, T Zhang, L Yi, Y Gao - arxiv preprint arxiv:2311.17842, 2023 - arxiv.org
In this study, we are interested in imbuing robots with the capability of physically-grounded
task planning. Recent advancements have shown that large language models (LLMs) …

Aligning text-to-image diffusion models with reward backpropagation

M Prabhudesai, A Goyal, D Pathak… - arxiv preprint arxiv …, 2023 - arxiv.org
Text-to-image diffusion models have recently emerged at the forefront of image generation,
powered by very large-scale unsupervised or weakly supervised text-to-image training …

Real-time anomaly detection and reactive planning with large language models

R Sinha, A Elhafsi, C Agia, M Foutter… - arxiv preprint arxiv …, 2024 - arxiv.org
Foundation models, eg, large language models (LLMs), trained on internet-scale data
possess zero-shot generalization capabilities that make them a promising technology …

Clip as rnn: Segment countless visual concepts without training endeavor

S Sun, R Li, P Torr, X Gu, S Li - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com
Existing open-vocabulary image segmentation methods require a fine-tuning step on mask
labels and/or image-text datasets. Mask labels are labor-intensive which limits the number of …

LLM multimodal traffic accident forecasting

I de Zarzà, J de Curtò, G Roig, CT Calafate - Sensors, 2023 - mdpi.com
With the rise in traffic congestion in urban centers, predicting accidents has become
paramount for city planning and public safety. This work comprehensively studied the …

How to prompt your robot: A promptbook for manipulation skills with code as policies

MG Arenas, T **ao, S Singh, V Jain… - … on Robotics and …, 2024 - ieeexplore.ieee.org
Large Language Models (LLMs) have demonstrated the ability to perform semantic
reasoning, planning and write code for robotics tasks. However, most methods rely on pre …

Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics

S Hartwig, D Engel, L Sick, H Kniesel, T Payer… - arxiv preprint arxiv …, 2024 - arxiv.org
Recent advances in text-to-image synthesis have been enabled by exploiting a combination
of language and vision through foundation models. These models are pre-trained on …

Context-aware entity grounding with open-vocabulary 3d scene graphs

H Chang, K Boyalakuntla, S Lu, S Cai, E **g… - arxiv preprint arxiv …, 2023 - arxiv.org
We present an Open-Vocabulary 3D Scene Graph (OVSG), a formal framework for
grounding a variety of entities, such as object instances, agents, and regions, with free-form …

Language-driven visual consensus for zero-shot semantic segmentation

Z Zhang, W Ke, Y Zhu, X Liang, J Liu… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org
The pre-trained vision-language model, exemplified by CLIP [1], advances zero-shot
semantic segmentation by aligning visual features with class embeddings through a …