Overview. Left: Our model takes as input a set of observed pixel locations and values ({(x_{k},v_{k})}) and can then predict the value distribution for any query position x. Right: This model can be trained in a self-supervised manner by drawing random samples from training images and maximizing likelihood of true values for random query locations. Our approach allows us to autoregressively model distributions for many spatial signals (e.g. images, shapes, videos, polynomials) conditioned on a sparse set of sample observations (e.g. pixels)
Image Completion. Top: Ground-truth Image. Bottom: Three random samples generated by our approach given 32 observed pixels (visualized in initial frame of animation).
Shape Generation. Left: Ground-truth 3D Shape and locations of the 32 input SDF samples. Right: Conditionally generated sample shapes using our approach.
Polynomial Prediction. Given evaluations of a degree-6 polynomial (green) at a sparse set of points (red), our model allows sampling diverse possible functions (yellow).
GT
Nearest Neighbor Visualization of Initially Observed Pixels
Generated Videos
Video Synthesis. Given a total of 1024 pixels across 30 frames, our model allows generating plausible videos that capture the coarse motion.
Acknowledgements
We would like to thank Deepak Pathak and the members of the CMU Visual Robot Learning lab for helpful discussions and feedback. This webpage template was borrowed from some colorful folks.