Paper ID | ARS-7.4 | ||
Paper Title | CAPTIONING TRANSFORMER WITH SCENE GRAPH GUIDING | ||
Authors | Haishun Chen, Ying Wang, Xin Yang, Jie Li, Xidian University, China | ||
Session | ARS-7: Image and Video Interpretation and Understanding 2 | ||
Location | Area H | ||
Session Time: | Wednesday, 22 September, 08:00 - 09:30 | ||
Presentation Time: | Wednesday, 22 September, 08:00 - 09:30 | ||
Presentation | Poster | ||
Topic | Image and Video Analysis, Synthesis, and Retrieval: Image & Video Interpretation and Understanding | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Image captioning is a challenging task which aims to generate descriptions of images. Most existing approaches adopt the encoder-decoder architecture, where encoder takes the image as input and decoder predicts corresponding word sequence. However, a common defect of these methods is that the abundant semantic relationships between relevant regions are ignored, leading the decoder to give a misled caption. To alleviate this issue, we propose a novel model, which utilizes sufficient semantic relationships provided by scene graph to guide the word generation process. To some extent, the scene graph narrows the semantic gap between images and descriptions, and hence improves the quality of generated sentences. Extensive experimental results demonstrate that our model achieves superior performance on various quantitative metrics. |