Sharegpt4v: Improving large multi-modal models with better captions
Modality alignment serves as the cornerstone for large multi-modal models (LMMs).
However, the impact of different attributes (eg, data type, quality, and scale) of training data …
However, the impact of different attributes (eg, data type, quality, and scale) of training data …
Llava-onevision: Easy visual task transfer
We present LLaVA-OneVision, a family of open large multimodal models (LMMs) developed
by consolidating our insights into data, models, and visual representations in the LLaVA …
by consolidating our insights into data, models, and visual representations in the LLaVA …
Internvideo2: Scaling foundation models for multimodal video understanding
We introduce InternVideo2, a new family of video foundation models (ViFM) that achieve the
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …
state-of-the-art results in video recognition, video-text tasks, and video-centric dialogue. Our …
Internlm-xcomposer-2.5: A versatile large vision language model supporting long-contextual input and output
We present InternLM-XComposer-2.5 (IXC-2.5), a versatile large-vision language model that
supports long-contextual input and output. IXC-2.5 excels in various text-image …
supports long-contextual input and output. IXC-2.5 excels in various text-image …
Vlmevalkit: An open-source toolkit for evaluating large multi-modality models
We present VLMEvalKit: an open-source toolkit for evaluating large multi-modality models
based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework …
based on PyTorch. The toolkit aims to provide a user-friendly and comprehensive framework …
Video instruction tuning with synthetic data
The development of video large multimodal models (LMMs) has been hindered by the
difficulty of curating large amounts of high-quality raw data from the web. To address this, we …
difficulty of curating large amounts of high-quality raw data from the web. To address this, we …
Video understanding with large language models: A survey
With the burgeoning growth of online video platforms and the escalating volume of video
content, the demand for proficient video understanding tools has intensified markedly. Given …
content, the demand for proficient video understanding tools has intensified markedly. Given …
Kangaroo: A powerful video-language model supporting long-context video input
Rapid advancements have been made in extending Large Language Models (LLMs) to
Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data …
Large Multi-modal Models (LMMs). However, extending input modality of LLMs to video data …
Longvila: Scaling long-context visual language models for long videos
Long-context capability is critical for multi-modal foundation models, especially for long
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …
video understanding. We introduce LongVILA, a full-stack solution for long-context visual …
Comat: Aligning text-to-image diffusion model with image-to-text concept matching
Diffusion models have demonstrated great success in the field of text-to-image generation.
However, alleviating the misalignment between the text prompts and images is still …
However, alleviating the misalignment between the text prompts and images is still …