| Paper ID | ARS-1.4 | ||
| Paper Title | GUIDANCE AND TEACHING NETWORK FOR VIDEO SALIENT OBJECT DETECTION | ||
| Authors | Yingxia Jiao, Wuhan University, China; Xiao Wang, Jiangxi University of Finance and Economics, China; Yu-Cheng Chou, Wuhan University, China; Shouyuan Yang, Jiangxi University of Finance and Economics, China; Ge-Peng Ji, Rong Zhu, Ge Gao, Wuhan University, China | ||
| Session | ARS-1: Object Detection | ||
| Location | Area I | ||
| Session Time: | Tuesday, 21 September, 15:30 - 17:00 | ||
| Presentation Time: | Tuesday, 21 September, 15:30 - 17:00 | ||
| Presentation | Poster | ||
| Topic | Image and Video Analysis, Synthesis, and Retrieval: Image & Video Interpretation and Understanding | ||
| IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
| Abstract | Owing to the difficulties of mining spatial-temporal cues,the existing approaches for video salient object detection(VSOD) are limited in understanding complex and noisy scenarios, and often fail in inferring prominent objects. Toalleviate such shortcomings, we propose a simple yet effi-cient architecture, termed Guidance and Teaching Network(GTNet), to independently distil effective spatial and temporal cues with implicit guidance and explicit teaching at feature- and decision-level, respectively. To be specific, we (a) introduce temporal modulator to implicitly bridge fea-tures from motion into appearance branch, which is capable of fusing cross-modal features collaboratively, and (b) utilise motion-guided mask to propagate the explicit cues during the feature aggregation. This novel learning strategy achieves satisfactory results via decoupling the complex spatial-temporal cues and mapping informative cues across different modalities. Extensive experiments on three challenging benchmarks show that the proposed method can run at ∼28fpson a single TITAN Xp GPU and perform competitively against 14cutting-edge baselines. | ||