Hierarchical Spatial-Temporal Video Generation Architecture

Encoding video into multi-scale latent video tokens and decoding video tokens back into the pixel 21 domain.

Last updated