[HTML][HTML] A state-of-the-art survey on deep learning theory and architectures
In recent years, deep learning has garnered tremendous success in a variety of application
domains. This new field of machine learning has been growing rapidly and has been …
domains. This new field of machine learning has been growing rapidly and has been …
The history began from alexnet: A comprehensive survey on deep learning approaches
Deep learning has demonstrated tremendous success in variety of application domains in
the past few years. This new field of machine learning has been growing rapidly and applied …
the past few years. This new field of machine learning has been growing rapidly and applied …
Clip4clip: An empirical study of clip for end to end video clip retrieval and captioning
Video clip retrieval and captioning tasks play an essential role in multimodal research and
are the fundamental research problem for multimodal understanding and generation. The …
are the fundamental research problem for multimodal understanding and generation. The …
Ai choreographer: Music conditioned 3d dance generation with aist++
We present AIST++, a new multi-modal dataset of 3D dance motion and music, along with
FACT, a Full-Attention Cross-modal Transformer network for generating 3D dance motion …
FACT, a Full-Attention Cross-modal Transformer network for generating 3D dance motion …
Exploring and distilling posterior and prior knowledge for radiology report generation
Automatically generating radiology reports can improve current clinical practice in diagnostic
radiology. On one hand, it can relieve radiologists from the heavy burden of report writing; …
radiology. On one hand, it can relieve radiologists from the heavy burden of report writing; …
Clip4clip: An empirical study of clip for end to end video clip retrieval
Video-text retrieval plays an essential role in multi-modal research and has been widely
used in many real-world web applications. The CLIP (Contrastive Language-Image Pre …
used in many real-world web applications. The CLIP (Contrastive Language-Image Pre …
End-to-end dense video captioning with parallel decoding
Dense video captioning aims to generate multiple associated captions with their temporal
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …
locations from the video. Previous methods follow a sophisticated" localize-then-describe" …
Howto100m: Learning a text-video embedding by watching hundred million narrated video clips
Learning text-video embeddings usually requires a dataset of video clips with manually
provided captions. However, such datasets are expensive and time consuming to create and …
provided captions. However, such datasets are expensive and time consuming to create and …
Centerclip: Token clustering for efficient text-video retrieval
Recently, large-scale pre-training methods like CLIP have made great progress in multi-
modal research such as text-video retrieval. In CLIP, transformers are vital for modeling …
modal research such as text-video retrieval. In CLIP, transformers are vital for modeling …
Attention, please! A survey of neural attention models in deep learning
In humans, Attention is a core property of all perceptual and cognitive operations. Given our
limited ability to process competing sources, attention mechanisms select, modulate, and …
limited ability to process competing sources, attention mechanisms select, modulate, and …