Paper ID | SMR-3.4 | ||
Paper Title | SELF-SUPERVISION BY PREDICTION FOR OBJECT DISCOVERY IN VIDEOS | ||
Authors | Beril Besbinar, Pascal Frossard, École Polytechnique Fédérale de Lausanne (EPFL), Switzerland | ||
Session | SMR-3: Image and Video Representation | ||
Location | Area F | ||
Session Time: | Tuesday, 21 September, 15:30 - 17:00 | ||
Presentation Time: | Tuesday, 21 September, 15:30 - 17:00 | ||
Presentation | Poster | ||
Topic | Image and Video Sensing, Modeling, and Representation: Image & video representation | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Despite their irresistible success, deep learning algorithms still heavily rely on annotated data, and unsupervised settings pose many challenges, such as finding the right inductive bias in diverse scenarios. In this paper, we propose an object-centric model for image sequence representation that uses the prediction task for self-supervision. By disentangling object representation and motion dynamics, our novel compositional structure explicitly handles occlusion and inpaints inferred objects and background for the composition of the predicted frame. Using auxiliary losses to promote spatially and temporally consistent object representations, we train our self-supervised framework without the help of any annotation or pretrained network. Initial experiments confirm that our new pipeline is a promising step towards object-centric video prediction. |