- Academic Search

Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning

Y Wang, W Chen, X Han, X Lin, H Zhao, Y Liu… - arxiv preprint arxiv …, 2024 - arxiv.org

Strong Artificial Intelligence (Strong AI) or Artificial General Intelligence (AGI) with abstract
reasoning ability is the goal of next-generation AI. Recent advancements in Large Language …

Save Cite Cited by 33 Related articles All 2 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning

Y Hu, F Lin, T Zhang, L Yi, Y Gao - arxiv preprint arxiv:2311.17842, 2023 - arxiv.org

In this study, we are interested in imbuing robots with the capability of physically-grounded
task planning. Recent advancements have shown that large language models (LLMs) …

Save Cite Cited by 89 Related articles All 3 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Aligning text-to-image diffusion models with reward backpropagation

M Prabhudesai, A Goyal, D Pathak… - arxiv preprint arxiv …, 2023 - arxiv.org

Text-to-image diffusion models have recently emerged at the forefront of image generation,
powered by very large-scale unsupervised or weakly supervised text-to-image training …

Save Cite Cited by 63 Related articles All 3 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Real-time anomaly detection and reactive planning with large language models

R Sinha, A Elhafsi, C Agia, M Foutter… - arxiv preprint arxiv …, 2024 - arxiv.org

Foundation models, eg, large language models (LLMs), trained on internet-scale data
possess zero-shot generalization capabilities that make them a promising technology …

Save Cite Cited by 20 Related articles All 7 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] thecvf.com

Clip as rnn: Segment countless visual concepts without training endeavor

S Sun, R Li, P Torr, X Gu, S Li - Proceedings of the IEEE …, 2024 - openaccess.thecvf.com

Existing open-vocabulary image segmentation methods require a fine-tuning step on mask
labels and/or image-text datasets. Mask labels are labor-intensive which limits the number of …

Save Cite Cited by 22 Related articles All 4 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] mdpi.com

LLM multimodal traffic accident forecasting

I de Zarzà, J de Curtò, G Roig, CT Calafate - Sensors, 2023 - mdpi.com

With the rise in traffic congestion in urban centers, predicting accidents has become
paramount for city planning and public safety. This work comprehensively studied the …

Save Cite Cited by 49 Related articles All 11 versions Free GPT-4 DeepSeek Cached

[Free GPT-4]
[DeepSeek]

[PDF] openreview.net

How to prompt your robot: A promptbook for manipulation skills with code as policies

MG Arenas, T **ao, S Singh, V Jain… - … on Robotics and …, 2024 - ieeexplore.ieee.org

Large Language Models (LLMs) have demonstrated the ability to perform semantic
reasoning, planning and write code for robotics tasks. However, most methods rely on pre …

Save Cite Cited by 17 Related articles All 2 versions Free GPT-4 DeepSeek

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics

S Hartwig, D Engel, L Sick, H Kniesel, T Payer… - arxiv preprint arxiv …, 2024 - arxiv.org

Recent advances in text-to-image synthesis have been enabled by exploiting a combination
of language and vision through foundation models. These models are pre-trained on …

Save Cite Cited by 9 Related articles All 2 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Context-aware entity grounding with open-vocabulary 3d scene graphs

H Chang, K Boyalakuntla, S Lu, S Cai, E **g… - arxiv preprint arxiv …, 2023 - arxiv.org

We present an Open-Vocabulary 3D Scene Graph (OVSG), a formal framework for
grounding a variety of entities, such as object instances, agents, and regions, with free-form …

Save Cite Cited by 23 Related articles All 6 versions Free GPT-4 DeepSeek View as HTML

[Free GPT-4]
[DeepSeek]

[PDF] arxiv.org

Language-driven visual consensus for zero-shot semantic segmentation

Z Zhang, W Ke, Y Zhu, X Liang, J Liu… - IEEE Transactions on …, 2024 - ieeexplore.ieee.org

The pre-trained vision-language model, exemplified by CLIP [1], advances zero-shot
semantic segmentation by aligning visual features with class embeddings through a …

Save Cite Cited by 4 Related articles All 2 versions Free GPT-4 DeepSeek

Create alert

Cite

Advanced search

Saved to My library

Simple open-vocabulary object detection with vision transformers. arxiv 2022

Exploring the reasoning abilities of multimodal large language models (mllms): A comprehensive survey on emerging trends in multimodal reasoning

Look before you leap: Unveiling the power of gpt-4v in robotic vision-language planning

Aligning text-to-image diffusion models with reward backpropagation

Real-time anomaly detection and reactive planning with large language models

Clip as rnn: Segment countless visual concepts without training endeavor

LLM multimodal traffic accident forecasting

How to prompt your robot: A promptbook for manipulation skills with code as policies

Evaluating Text to Image Synthesis: Survey and Taxonomy of Image Quality Metrics

Context-aware entity grounding with open-vocabulary 3d scene graphs

Language-driven visual consensus for zero-shot semantic segmentation