Multimodal image synthesis and editing: A survey and taxonomy
As information exists in various modalities in real world, effective interaction and fusion
among multimodal information plays a key role for the creation and perception of multimodal …
among multimodal information plays a key role for the creation and perception of multimodal …
On the opportunities and challenges of foundation models for geospatial artificial intelligence
Large pre-trained models, also known as foundation models (FMs), are trained in a task-
agnostic manner on large-scale data and can be adapted to a wide range of downstream …
agnostic manner on large-scale data and can be adapted to a wide range of downstream …
Adversarial diffusion distillation
Abstract We introduce Adversarial Diffusion Distillation (ADD), a novel training approach that
efficiently samples large-scale foundational image diffusion models in just 1–4 steps while …
efficiently samples large-scale foundational image diffusion models in just 1–4 steps while …
Stylegan-t: Unlocking the power of gans for fast large-scale text-to-image synthesis
Text-to-image synthesis has recently seen significant progress thanks to large pretrained
language models, large-scale training data, and the introduction of scalable model families …
language models, large-scale training data, and the introduction of scalable model families …
Instantbooth: Personalized text-to-image generation without test-time finetuning
Recent advances in personalized image generation have enabled pre-trained text-to-image
models to learn new concepts from specific image sets. However these methods often …
models to learn new concepts from specific image sets. However these methods often …
PIXART-: Weak-to-Strong Training of Diffusion Transformer for 4K Text-to-Image Generation
In this paper, we introduce PixArt-Σ, a Diffusion Transformer model (DiT) capable of directly
generating images at 4K resolution. PixArt-Σ represents a significant advancement over its …
generating images at 4K resolution. PixArt-Σ represents a significant advancement over its …
Grm: Large gaussian reconstruction model for efficient 3d reconstruction and generation
We introduce GRM, a large-scale reconstructor capable of recovering a 3D asset from
sparse-view images in around 0.1 s. GRM is a feed-forward transformer-based model that …
sparse-view images in around 0.1 s. GRM is a feed-forward transformer-based model that …
Fastcomposer: Tuning-free multi-subject image generation with localized attention
Diffusion models excel at text-to-image generation, especially in subject-driven generation
for personalized images. However, existing methods are inefficient due to the subject …
for personalized images. However, existing methods are inefficient due to the subject …
Ablating concepts in text-to-image diffusion models
Large-scale text-to-image diffusion models can generate high-fidelity images with powerful
compositional ability. However, these models are typically trained on an enormous amount …
compositional ability. However, these models are typically trained on an enormous amount …
Multimodal foundation models: From specialists to general-purpose assistants
Neural compression is the application of neural networks and other machine learning
methods to data compression. Recent advances in statistical machine learning have opened …
methods to data compression. Recent advances in statistical machine learning have opened …