Lvlm-ehub: A comprehensive evaluation benchmark for large vision-language models
Large Vision-Language Models (LVLMs) have recently played a dominant role in
multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation …
multimodal vision-language learning. Despite the great success, it lacks a holistic evaluation …
Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …
On the hidden mystery of ocr in large multimodal models
Large models have recently played a dominant role in natural language processing and
multimodal vision-language learning. However, their effectiveness in text-related visual …
multimodal vision-language learning. However, their effectiveness in text-related visual …
Revisiting scene text recognition: A data perspective
This paper aims to re-assess scene text recognition (STR) from a data-oriented perspective.
We begin by revisiting the six commonly used benchmarks in STR and observe a trend of …
We begin by revisiting the six commonly used benchmarks in STR and observe a trend of …
Nvlm: Open frontier-class multimodal llms
We introduce NVLM 1.0, a family of frontier-class multimodal large language models (LLMs)
that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary …
that achieve state-of-the-art results on vision-language tasks, rivaling the leading proprietary …
Exploring ocr capabilities of gpt-4v (ision): A quantitative and in-depth evaluation
This paper presents a comprehensive evaluation of the Optical Character Recognition
(OCR) capabilities of the recently released GPT-4V (ision), a Large Multimodal Model …
(OCR) capabilities of the recently released GPT-4V (ision), a Large Multimodal Model …
One-dm: One-shot diffusion mimicker for handwritten text generation
Existing handwritten text generation methods often require more than ten handwriting
samples as style references. However, in practical applications, users tend to prefer a …
samples as style references. However, in practical applications, users tend to prefer a …
Cdistnet: Perceiving multi-domain character distance for robust text recognition
The transformer-based encoder-decoder framework is becoming popular in scene text
recognition, largely because it naturally integrates recognition clues from both visual and …
recognition, largely because it naturally integrates recognition clues from both visual and …
Choose What You Need: Disentangled Representation Learning for Scene Text Recognition Removal and Editing
Scene text images contain not only style information (font background) but also content
information (character texture). Different scene text tasks need different information but …
information (character texture). Different scene text tasks need different information but …
NVILA: Efficient frontier visual language models
Visual language models (VLMs) have made significant advances in accuracy in recent
years. However, their efficiency has received much less attention. This paper introduces …
years. However, their efficiency has received much less attention. This paper introduces …