Git: A generative image-to-text transformer for vision and language
In this paper, we design and train a Generative Image-to-text Transformer, GIT, to unify
vision-language tasks such as image/video captioning and question answering. While …
vision-language tasks such as image/video captioning and question answering. While …
Trocr: Transformer-based optical character recognition with pre-trained models
Text recognition is a long-standing research problem for document digitalization. Existing
approaches are usually built based on CNN for image understanding and RNN for char …
approaches are usually built based on CNN for image understanding and RNN for char …
Read like humans: Autonomous, bidirectional and iterative language modeling for scene text recognition
Linguistic knowledge is of great benefit to scene text recognition. However, how to effectively
model linguistic rules in end-to-end deep networks remains a research challenge. In this …
model linguistic rules in end-to-end deep networks remains a research challenge. In this …
Scene text recognition with permuted autoregressive sequence models
Context-aware STR methods typically use internal autoregressive (AR) language models
(LM). Inherent limitations of AR models motivated two-stage methods which employ an …
(LM). Inherent limitations of AR models motivated two-stage methods which employ an …
From two to one: A new scene text recognizer with visual language modeling network
In this paper, we abandon the dominant complex language model and rethink the linguistic
learning process in the scene text recognition. Different from previous methods considering …
learning process in the scene text recognition. Different from previous methods considering …
Abcnet v2: Adaptive bezier-curve network for real-time end-to-end text spotting
End-to-end text-spotting, which aims to integrate detection and recognition in a unified
framework, has attracted increasing attention due to its simplicity of the two complimentary …
framework, has attracted increasing attention due to its simplicity of the two complimentary …
Revisiting scene text recognition: A data perspective
This paper aims to re-assess scene text recognition (STR) from a data-oriented perspective.
We begin by revisiting the six commonly used benchmarks in STR and observe a trend of …
We begin by revisiting the six commonly used benchmarks in STR and observe a trend of …
Vision transformer with progressive sampling
Transformers with powerful global relation modeling abilities have been introduced to
fundamental computer vision tasks recently. As a typical example, the Vision Transformer …
fundamental computer vision tasks recently. As a typical example, the Vision Transformer …
Multi-granularity prediction for scene text recognition
Scene text recognition (STR) has been an active research topic in computer vision for years.
To tackle this challenging problem, numerous innovative methods have been successively …
To tackle this challenging problem, numerous innovative methods have been successively …
Abinet++: Autonomous, bidirectional and iterative language modeling for scene text spotting
Scene text spotting is of great importance to the computer vision community due to its wide
variety of applications. Recent methods attempt to introduce linguistic knowledge for …
variety of applications. Recent methods attempt to introduce linguistic knowledge for …