MoLA: Motion Generation and Editing
with Latent Diffusion Enhanced
by Adversarial Training

Kengo Uchida1, Takashi Shibuya1, Yuhta Takida1, Naoki Murata1, Julian Tanke1, Shusuke Takahashi2, Yuki Mitsufuji1,2,
1Sony AI, 2Sony Group Corporation

Abstract

In text-to-motion generation, controllability as well as generation quality and speed has become increasingly critical. The controllability challenges include generating a motion of a length that matches the given textual description and editing the generated motions according to control signals, such as the start-end positions and the pelvis trajectory. In this paper, we propose MoLA, which provides fast, high-quality, variable-length motion generation and can also deal with multiple editing tasks in a single framework. Our approach revisits the motion representation used as inputs and outputs in the model, incorporating an activation variable to enable variable-length motion generation. Additionally, we integrate a variational autoencoder and a latent diffusion model, further enhanced through adversarial training, to achieve high-quality and fast generation. Moreover, we apply a training-free guided generation framework to achieve various editing tasks with motion control inputs. We quantitatively show the effectiveness of adversarial learning in text-to-motion generation, and demonstrate the applicability of our editing framework to multiple editing tasks in the motion domain.

Motivation

MY ALT TEXT

MoLA achieves 1) fast generation, 2) a high generation performance, and 3) multiple motion editing in a training-free manner, and significantly extends the performance boundaries (in terms of generation quality and speed) of methods categorized as enabling training-free editing.

Pipeline

MY ALT TEXT

The overall framework of MoLA. Stage 1: A motion VAE enhanced by adversarial training learns a low-dimensional latent representation of diverse motion sequences. Stage 2: A text-conditioned latent diffusion model leverages this representation for fast and high-quality text-to-motion generation. Guided generation: During inference, a gradient-based method minimizes a loss function for each desired editing task, enabling multiple motion editing tasks within a unified framework.

Results

MY ALT TEXT

Motion generation Demo

Motion editing Demo

BibTeX

@article{uchida2024mola,
        title={MoLA: Motion Generation and Editing with Latent Diffusion Enhanced by Adversarial Training},
        author={Uchida, Kengo and Shibuya, Takashi and Takida, Yuhta and Murata, Naoki and Tanke, Julian and Takahashi, Shusuke and Mitsufuji, Yuki},
        journal={arXiv preprint arXiv:2406.01867},
        year={2024}
      }