LARP: Tokenizing Videos 🎬 with a Learned Autoregressive Generative Prior 🚀

University of Maryland, College Park
teaser

(a) LARP is a video tokenizer for two-stage video generative models. In the first stage, the LARP tokenizer is trained with a lightweight AR prior model to learn an AR-friendly latent space. In the second stage, an AR generative model is trained on LARP's discrete tokens to synthesize high-fidelity videos. (b) The incorporation of the AR prior model significantly improves the generation FVD (gFVD) across various token number configurations. (c) LARP shows a much smaller gap between its reconstruction FVD (rFVD) and generation FVD (gFVD), indicating the effectiveness of the optimized latent space it has learned.

Abstract

We present LARP, a novel video tokenizer designed to overcome limitations in current video tokenization methods for autoregressive (AR) generative models. Unlike traditional patchwise tokenizers that directly encode local visual patches into discrete tokens, LARP introduces a holistic tokenization scheme that gathers information from the visual content using a set of learned holistic queries. This design allows LARP to capture more global and semantic representations, rather than being limited to local patch-level information. Furthermore, it offers flexibility by supporting an arbitrary number of discrete tokens, enabling adaptive and efficient tokenization based on the specific requirements of the task. To align the discrete token space with downstream AR generation tasks, LARP integrates a lightweight AR transformer as a training-time prior model that predicts the next token on its discrete latent space. By incorporating the prior model during training, LARP learns a latent space that is not only optimized for video reconstruction but is also structured in a way that is more conducive to autoregressive generation. Moreover, this process defines a sequential order for the discrete tokens, progressively pushing them toward an optimal configuration during training, ensuring smoother and more accurate AR generation at inference time. Comprehensive experiments demonstrate LARP's strong performance, achieving state-of-the-art FVD on the UCF101 class-conditional video generation benchmark. LARP enhances the compatibility of AR models with videos and opens up the potential to build unified high-fidelity multimodal large language models (MLLMs).



🔥 Highlights

🌟 We present LARP, a novel video tokenizer that enables flexible, holistic tokenization, allowing for more semantic and global video representations.


🚀 LARP features a learned AR generative prior, achieved by co-training an AR prior model, which effectively aligns LARP's latent space with the downstream AR generation task.


🏆 LARP significantly improves video generation quality for AR models across varying token sequence lengths, achieving state-of-the-art FVD performance on the UCF101 class-conditional video generation benchmark and outperforming all AR methods on the K600 frame prediction benchmark.

Method Overview

dataset_statistics

Cubes represent video patches, circles indicate continuous embeddings, and squares denote discrete tokens. (a) Patchwise video tokenizer used in previous works. (b) Left: The LARP tokenizer tokenizes videos in a holistic scheme, gathering information from the video using a set of learned queries. Right: The AR prior model, trained with LARP, predicts the next holistic token, enabling a latent space optimized for AR generation. The AR prior model is forwarded in two rounds per iteration. The red arrow represents the first round, and the purple arrows represent the second round. The reconstruction loss ℒrec is omitted for simplicity.

SOTA Results

dataset_statistics

Results are grouped by the type of generative models. The scores for MAGVIT-AR and MAGVIT-v2-AR are taken from the appendix of MAGVIT-v2. LARP-L-Long denotes the LARP-L trained for more epochs. Our best results are obtained with a larger AR generator.

Visualization


UCF101 Class-conditional Generation


Kinetics-600 Frame Prediction

For each group: Left shows the first 5 frames of the video used as input, Right displays the entire predicted video.

BibTeX

@article{larp,
    title={LARP: Tokenizing Videos with a Learned Autoregressive Generative Prior},
    author={Wang, Hanyu and Suri, Saksham and Ren, Yixuan and Chen, Hao and Shrivastava, Abhinav},
    journal={arXiv preprint arXiv:2410.21264},
    year={2024}
}