VDT: General-purpose Video Diffusion Transformers via Mask Modeling This work introduces Video Diffusion Transformer (VDT), which pioneers the use of transformers in diffusion-based video generation It features transformer blocks with modularized temporal and spatial attention modules to leverage the rich spatial-temporal representation inherited in transformers
VDT: G PURPOSE VIDEO DIFFUSION TRANS FORMERS VIA MODELING - OpenReview Our VDT showcases strong video generation potential and can seamlessly extend to and perform well on a broader array of video generation tasks through our unified spatial-temporal mask modeling mechanism, without requiring modifications to the underlying architecture