图标描述 LivePhoto: Real Image Animation with Text-guided Motion Control

1The University of Hong Kong   2Alibaba Group   3Ant Group

Reference Image

image

"*hair flying in the wind."

image

Reference Image

image

"The panda is eating bamboo."

image

Reference Image

image

"The minion is jumping."

image

Reference Image

image

"The candles burn fast."

image

Reference Image

image

"Pouring water into the glass."

image

Reference Image

image

"The fire is burning."

image

Reference Image

image

"* camera from right to left."

image

Reference Image

image

"* camera turns around."

image

Reference Image

image

"* camera-zooms-in."

image

Reference Image

image

"Snowflakes falling *."

image

Reference Image

image

"Wind blows the sunflowers."

image

Reference Image

image

"Fireworks bloom in the sky."

image

Abstract

Despite the recent progress in text-to-video generation, existing studies usually overlook the issue that only spatial contents but not temporal motions in synthesized videos are under the control of text. Towards such a challenge, this work presents a practical system, named LivePhoto, which allows users to animate an image of their interest with text descriptions. We first establish a strong baseline that helps a well-learned text-to-image generator (i.e., Stable Diffusion) take an image as a further input. We then equip the improved generator with a motion module for temporal modeling and propose a carefully designed training pipeline to better link texts and motions. In particular, considering the facts that (1) text can only describe motions roughly (e.g., regardless of the moving speed) and (2) text may include both content and motion descriptions, we introduce a motion intensity estimation module as well as a text re-weighting module to reduce the ambiguity of text-to-motion mapping. Empirical evidence suggests that our approach is capable of well decoding motion-related textual instructions into videos, such as actions, camera movements, or even conjuring new contents from thin air (e.g., pouring water into an empty glass). Interestingly, thanks to the proposed intensity learning mechanism, our system offers users an additional control signal (i.e., the motion intensity) besides text for video customization.

Motion Control with Text Instructions

Our unique feature is the precise motion control through text instructions. Furthermore, users have the ability to customize these motions by setting different "motion intensities".

Reference image

image

"The man smiles."

GIF 1

"The man gives a thumbs-up."

GIF 2

"The man is drinking beer."

GIF 3

Motion Intensity: 2

image

Motion Intensity: 5

GIF 1

Motion Intensity: 3

GIF 2

Motion Intensity: 7

GIF 3

Comparisons with Existing Alternatives

We conducted a comparative analysis of LivePhoto with GEN-2 and Pikalabs, specifically using their versions as of November 2023. The generated videos are presented sequentially from left to right, originating from GEN-2, Pikalabs, and then LivePhoto. Notably, LivePhoto demonstrates superior performance in text-guided motion control.

"Pikachu is dancing happily."

image

Reference Image

GIF 1

GEN-2

GIF 2

Pikalabs

GIF 3

LivePhoto

"Kung Fu Panda is practicing Tai Chi."

image

Reference Image

GIF 1

GEN-2

GIF 2

Pikalabs

GIF 3

LivePhoto

"The little yellow baby dinosaur is waving its hand."

image

Reference Image

GIF 1

GEN-2

GIF 2

Pikalabs

GIF 3

LivePhoto

"The volcano emits thick smoke from its crater."

image

Reference Image

GIF 1

GEN-2

GIF 2

Pikalabs

GIF 3

LivePhoto

"Lightning and thunder in the night sky."

image

Reference Image

GIF 1

GEN-2

GIF 2

Pikalabs

GIF 3

LivePhoto

"Fire burns on the grass stack."

image

Reference Image

GIF 1

GEN-2

GIF 2

Pikalabs

GIF 3

LivePhoto

"Dew dripping from the leaves."

image

Reference Image

GIF 1

GEN-2

GIF 2

Pikalabs

GIF 3

LivePhoto

Pipeline

Overall pipeline of LivePhoto is shown below, besides taking the reference image and text as input, LivePhoto leverages the motion intensity as a supplementary condition. The image and the motion intensity (from level 1 to 10) are obtained from the ground truth video during training and customized by users during inference. The reference latent is first extracted as local content guidance. We concatenate it with the noise latent, a frame embedding, and the intensity embedding. This 10-channel tensor is fed into the UNet for denoising. During inference, we use the inversion of the reference latent instead of the pure Gaussian to provide content priors. At the top, a content encoder extracts the visual tokens to provide global content guidance. At the bottom, we introduce text re-weighting, which learns to emphasize the motion-related part of the text embedding for better text-motion mapping. The visual and textual tokens are injected into the UNet via cross-attention. For the UNet, we freeze the pre-trained stable diffusion and insert motion modules to capture the inter-frame relations. Symbols of flames and snowflakes denote trainable and frozen parameters respectively.

pipeline

Video Introduction

BibTeX

@article{chen2023livephoto,
      title={LivePhoto: Real Image Animation with Text-guided Motion Control},
      author={Chen, Xi and Liu, Zhiheng and Chen, Mengting and Feng, Yutong and Liu, Yu and Shen, Yujun and Zhao, Hengshuang},
      journal={arXiv preprint arXiv:2312.02928},
      year={2023}
    }