BRAVE: Broadening the visual encoding of vision-language models

OF Kar, A Tonioni, P Poklukar, A Kulshrestha… - … on Computer Vision, 2024 - Springer
Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …

Slowfast-llava: A strong training-free baseline for video large language models

M Xu, M Gao, Z Gan, HY Chen, Z Lai, H Gang… - arxiv preprint arxiv …, 2024 - arxiv.org
We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …

Omnigen: Unified image generation

S **ao, Y Wang, J Zhou, H Yuan, X **ng, R Yan… - arxiv preprint arxiv …, 2024 - arxiv.org
In this work, we introduce OmniGen, a new diffusion model for unified image generation.
Unlike popular diffusion models (eg, Stable Diffusion), OmniGen no longer requires …

Multi-modal LLMs in agriculture: A comprehensive review

R Sapkota, R Qureshi, SZ Hassan, J Shutske… - Authorea …, 2024 - techrxiv.org
Given the rapid emergence and applications of Large Language Models (LLMs) across
various scientific fields, insights regarding their applicability in agriculture are still only …

PixWizard: Versatile image-to-image visual assistant with open-language instructions

W Lin, X Wei, R Zhang, L Zhuo, S Zhao… - arxiv preprint arxiv …, 2024 - arxiv.org
This paper presents a versatile image-to-image visual assistant, PixWizard, designed for
image generation, manipulation, and translation based on free-from language instructions …

Towards a science exocortex

KG Yager - Digital Discovery, 2024 - pubs.rsc.org
Artificial intelligence (AI) methods are poised to revolutionize intellectual work, with
generative AI enabling automation of text analysis, text generation, and simple decision …

Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers

E Karypidis, I Kakogeorgiou, S Gidaris… - arxiv preprint arxiv …, 2025 - arxiv.org
Semantic future prediction is important for autonomous systems navigating dynamic
environments. This paper introduces FUTURIST, a method for multimodal future semantic …

[HTML][HTML] Enhancing foundation models for scientific discovery via multimodal knowledge graph representations

V Lopez, L Hoang, M Martinez-Galindo… - Journal of Web …, 2025 - Elsevier
Abstract Foundation Models (FMs) hold transformative potential to accelerate scientific
discovery, yet reaching their full capacity in complex, highly multimodal domains such as …

BiFold: Bimanual Cloth Folding with Language Guidance

O Barbany, A Colomé, C Torras - arxiv preprint arxiv:2501.16458, 2025 - arxiv.org
Cloth folding is a complex task due to the inevitable self-occlusions of clothes, their
complicated dynamics, and the disparate materials, geometries, and textures that garments …

Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions

J Zhang, O Liu, T Yu, J Hu, W Neiswanger - arxiv preprint arxiv …, 2024 - arxiv.org
Multimodal large language models (MLLMs) have made rapid progress in recent years, yet
continue to struggle with low-level visual perception (LLVP)--particularly the ability to …