Paper ID | ARS-2.10 | ||
Paper Title | CMF: CASCADED MULTI-MODEL FUSION FOR REFERRING IMAGE SEGMENTATION | ||
Authors | Jianhua Yang, Beijing University of Posts and Telecommunications, China; Yan Huang, Institute of Automation, Chinese Academy of Sciences, China; Zhanyu Ma, Beijing University of Posts and Telecommunications, China; Liang Wang, Institute of Automation, Chinese Academy of Sciences, China | ||
Session | ARS-2: Image and Video Segmentation | ||
Location | Area I | ||
Session Time: | Monday, 20 September, 15:30 - 17:00 | ||
Presentation Time: | Monday, 20 September, 15:30 - 17:00 | ||
Presentation | Poster | ||
Topic | Image and Video Analysis, Synthesis, and Retrieval: Image & Video Mid-Level Analysis | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | In this work, we address the task of referring image segmentation (RIS), which aims at predicting a segmentation mask for the object described by a natural language expression. Most existing methods focus on establishing unidirectional or directional relationships between visual and linguistic features to associate two modalities together, while the multiscale context is ignored or insufficiently modeled. Multi-scale context is crucial to localize and segment those objects that have large scale variations during the multi-modal fusion process. To solve this problem, we propose a simple yet effective Cascaded Multi-modal Fusion (CMF) module, which stacks multiple atrous convolutional layers in parallel and further introduces a cascaded branch to fuse visual and linguistic features. The cascaded branch can progressively integrate multi-scale contextual information and facilitate the alignment of two modalities during the multi-modal fusion process. Experimental results on four benchmark datasets demonstrate that our method outperforms most state-of-the-art methods. |