UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics

1The University of Hong Kong   2Adobe  
teaser

Demonstrations of UniReal's versatile capabilities. As a universal framework, UniReal supports a broad spectrum of image generation and editing tasks within a single model, accommodating diverse input-output configurations and generating highly realistic results, which effectively handle challenging scenarios, e.g., shadows, reflections, lighting effects, object pose changes, etc.

Free-form Instructive Editing

Instruct
Instruct

Subject-driven Image Customization

dreambooth
dreambooth

Human Image Personlization

dreambooth

Object/Part Insertion

dreambooth

Image Understanding

dreambooth
dreambooth

More Applications

teaser

Abstract

We introduce UniReal, a unified framework designed to address various image generation and editing tasks. Existing solutions often vary by tasks, yet share fundamental principles: preserving consistency between inputs and outputs while capturing visual variations. Inspired by recent video generation models that effectively balance consistency and variation across frames, we propose a unifying approach that treats image-level tasks as discontinuous video generation. Specifically, we treat varying numbers of input and output images as frames, enabling seamless support for tasks such as image generation, editing, customization, composition, etc. Although designed for image-level tasks, we leverage videos as a scalable source for universal supervision. UniReal learns world dynamics from large-scale videos, demonstrating advanced capability in handling shadows, reflections, pose variation, and object interaction, while also exhibiting emergent capability for novel applications.

Overall Pipeline

Overall Pipeline

We formulate image generation and editing tasks as discontinuous frame generation. First, input images are encoded into latent space by VAE encoder. Then, we patchify the image latent and noise latent into visual tokens. Afterward, we add index embeddings and image prompt (asset/canvas/control) to the visual tokens. At the same time, the context prompt and base prompt are processed by the T5 encoder. We concatenate all the latent patches and text embeddings as a long 1D tensor and send them to the transformer. Finally, we decode the denoised results to get the desired output images.

Video Introduction

BibTeX

@article{chen2024UniReal,
      title={UniReal: Universal Image Generation and Editing via Learning Real-world Dynamics},
      author={Chen, Xi and Zhang, Zhifei and Zhang, He and Zhou, Yuqian and Kim, Soo Ye and Liu, Qing and Li, Yijun and Zhang, Jianming and Zhao, Nanxuan and Wang, Yilin and Ding, Hui and Lin, Zhe and Hengshuang},
      journal={arXiv preprint arXiv:2412.07774},
      year={2024}
    }