Paper ID | 3D-4.12 | ||
Paper Title | RETHINKING TRAINING OBJECTIVE FOR SELF-SUPERVISED MONOCULAR DEPTH ESTIMATION: SEMANTIC CUES TO RESCUE | ||
Authors | Keyao Li, Ge Li, Peking University Shenzhen Graduate School, China; Thomas Li, Peking University, China | ||
Session | 3D-4: 3D Image and Video Processing | ||
Location | Area J | ||
Session Time: | Tuesday, 21 September, 13:30 - 15:00 | ||
Presentation Time: | Tuesday, 21 September, 13:30 - 15:00 | ||
Presentation | Poster | ||
Topic | Three-Dimensional Image and Video Processing: Image and video processing augmented and virtual reality | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Monocular depth estimation finds a wide range of applications in modeling 3D scenes. Since it is expensive to collect ground truth labels to supervise training, plenty of works have been done in a self-supervised manner. A common practice is to train the network optimizing a photometric objective (i.e., view synthesis ) due to its effectiveness. However, this training objective is sensitive to optical changes and lacks a consideration of object-level cues, which leads to sub-optimal results in some cases, e.g., artifacts in complex regions and depth discontinuities around thin structures. We summarize them as depth ambiguities. In this paper, we propose an easy yet effective architecture, introducing semantic cues into supervision to solve the problems mentioned above. First, through our study on the problems, we figure out that they are due to the limitation of the commonly applied photometric reconstruction training objective. Then we come up with our method using semantic cues to encode the geometry constraint behind view synthesis. The proposed novel objective is more credible towards confusing pixels, also takes an object-level perception. Experiments show that without introducing extra inference complexity, our method alleviates depth ambiguities greatly and performs comparably with state-of-the-art methods on KITTI benchmark. |