Paper ID | ARS-1.4 | ||
Paper Title | GUIDANCE AND TEACHING NETWORK FOR VIDEO SALIENT OBJECT DETECTION | ||
Authors | Yingxia Jiao, Wuhan University, China; Xiao Wang, Jiangxi University of Finance and Economics, China; Yu-Cheng Chou, Wuhan University, China; Shouyuan Yang, Jiangxi University of Finance and Economics, China; Ge-Peng Ji, Rong Zhu, Ge Gao, Wuhan University, China | ||
Session | ARS-1: Object Detection | ||
Location | Area I | ||
Session Time: | Tuesday, 21 September, 15:30 - 17:00 | ||
Presentation Time: | Tuesday, 21 September, 15:30 - 17:00 | ||
Presentation | Poster | ||
Topic | Image and Video Analysis, Synthesis, and Retrieval: Image & Video Interpretation and Understanding | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Owing to the difficulties of mining spatial-temporal cues,the existing approaches for video salient object detection(VSOD) are limited in understanding complex and noisy scenarios, and often fail in inferring prominent objects. Toalleviate such shortcomings, we propose a simple yet effi-cient architecture, termed Guidance and Teaching Network(GTNet), to independently distil effective spatial and temporal cues with implicit guidance and explicit teaching at feature- and decision-level, respectively. To be specific, we (a) introduce temporal modulator to implicitly bridge fea-tures from motion into appearance branch, which is capable of fusing cross-modal features collaboratively, and (b) utilise motion-guided mask to propagate the explicit cues during the feature aggregation. This novel learning strategy achieves satisfactory results via decoupling the complex spatial-temporal cues and mapping informative cues across different modalities. Extensive experiments on three challenging benchmarks show that the proposed method can run at ∼28fpson a single TITAN Xp GPU and perform competitively against 14cutting-edge baselines. |