BRAVE: Broadening the visual encoding of vision-language models
Vision-language models (VLMs) are typically composed of a vision encoder, eg CLIP, and a
language model (LM) that interprets the encoded features to solve downstream tasks …
language model (LM) that interprets the encoded features to solve downstream tasks …
Slowfast-llava: A strong training-free baseline for video large language models
We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free video large language
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …
model (LLM) that can jointly capture detailed spatial semantics and long-range temporal …
Omnigen: Unified image generation
In this work, we introduce OmniGen, a new diffusion model for unified image generation.
Unlike popular diffusion models (eg, Stable Diffusion), OmniGen no longer requires …
Unlike popular diffusion models (eg, Stable Diffusion), OmniGen no longer requires …
Multi-modal LLMs in agriculture: A comprehensive review
Given the rapid emergence and applications of Large Language Models (LLMs) across
various scientific fields, insights regarding their applicability in agriculture are still only …
various scientific fields, insights regarding their applicability in agriculture are still only …
PixWizard: Versatile image-to-image visual assistant with open-language instructions
This paper presents a versatile image-to-image visual assistant, PixWizard, designed for
image generation, manipulation, and translation based on free-from language instructions …
image generation, manipulation, and translation based on free-from language instructions …
Towards a science exocortex
KG Yager - Digital Discovery, 2024 - pubs.rsc.org
Artificial intelligence (AI) methods are poised to revolutionize intellectual work, with
generative AI enabling automation of text analysis, text generation, and simple decision …
generative AI enabling automation of text analysis, text generation, and simple decision …
Advancing Semantic Future Prediction through Multimodal Visual Sequence Transformers
Semantic future prediction is important for autonomous systems navigating dynamic
environments. This paper introduces FUTURIST, a method for multimodal future semantic …
environments. This paper introduces FUTURIST, a method for multimodal future semantic …
[HTML][HTML] Enhancing foundation models for scientific discovery via multimodal knowledge graph representations
Abstract Foundation Models (FMs) hold transformative potential to accelerate scientific
discovery, yet reaching their full capacity in complex, highly multimodal domains such as …
discovery, yet reaching their full capacity in complex, highly multimodal domains such as …
BiFold: Bimanual Cloth Folding with Language Guidance
Cloth folding is a complex task due to the inevitable self-occlusions of clothes, their
complicated dynamics, and the disparate materials, geometries, and textures that garments …
complicated dynamics, and the disparate materials, geometries, and textures that garments …
Euclid: Supercharging Multimodal LLMs with Synthetic High-Fidelity Visual Descriptions
Multimodal large language models (MLLMs) have made rapid progress in recent years, yet
continue to struggle with low-level visual perception (LLVP)--particularly the ability to …
continue to struggle with low-level visual perception (LLVP)--particularly the ability to …