Login Paper Search My Schedule Paper Index Help

My ICIP 2021 Schedule

Note: Your custom schedule will not be saved unless you create a new account or login to an existing account.
  1. Create a login based on your email (takes less than one minute)
  2. Perform 'Paper Search'
  3. Select papers that you desire to save in your personalized schedule
  4. Click on 'My Schedule' to see the current list of selected papers
  5. Click on 'Printable Version' to create a separate window suitable for printing (the header and menu will appear, but will not actually print)

Paper Detail

Paper ID3D-4.12
Paper Title RETHINKING TRAINING OBJECTIVE FOR SELF-SUPERVISED MONOCULAR DEPTH ESTIMATION: SEMANTIC CUES TO RESCUE
Authors Keyao Li, Ge Li, Peking University Shenzhen Graduate School, China; Thomas Li, Peking University, China
Session3D-4: 3D Image and Video Processing
LocationArea J
Session Time:Tuesday, 21 September, 13:30 - 15:00
Presentation Time:Tuesday, 21 September, 13:30 - 15:00
Presentation Poster
Topic Three-Dimensional Image and Video Processing: Image and video processing augmented and virtual reality
IEEE Xplore Open Preview  Click here to view in IEEE Xplore
Abstract Monocular depth estimation finds a wide range of applications in modeling 3D scenes. Since it is expensive to collect ground truth labels to supervise training, plenty of works have been done in a self-supervised manner. A common practice is to train the network optimizing a photometric objective (i.e., view synthesis ) due to its effectiveness. However, this training objective is sensitive to optical changes and lacks a consideration of object-level cues, which leads to sub-optimal results in some cases, e.g., artifacts in complex regions and depth discontinuities around thin structures. We summarize them as depth ambiguities. In this paper, we propose an easy yet effective architecture, introducing semantic cues into supervision to solve the problems mentioned above. First, through our study on the problems, we figure out that they are due to the limitation of the commonly applied photometric reconstruction training objective. Then we come up with our method using semantic cues to encode the geometry constraint behind view synthesis. The proposed novel objective is more credible towards confusing pixels, also takes an object-level perception. Experiments show that without introducing extra inference complexity, our method alleviates depth ambiguities greatly and performs comparably with state-of-the-art methods on KITTI benchmark.