Paper ID | MLR-APPL-IVASR-6.11 | ||
Paper Title | SEMANTIC ROLE AWARE CORRELATION TRANSFORMER FOR TEXT TO VIDEO RETRIEVAL | ||
Authors | Burak Satar, Hongyuan Zhu, Agency for Science, Technology and Research (A*STAR), Singapore; Xavier Bresson, Nanyang Technological University (NTU), Singapore; Joo-Hwee Lim, Agency for Science, Technology and Research (A*STAR), Singapore | ||
Session | MLR-APPL-IVASR-6: Machine learning for image and video analysis, synthesis, and retrieval 6 | ||
Location | Area D | ||
Session Time: | Wednesday, 22 September, 08:00 - 09:30 | ||
Presentation Time: | Wednesday, 22 September, 08:00 - 09:30 | ||
Presentation | Poster | ||
Topic | Applications of Machine Learning: Machine learning for image & video analysis, synthesis, and retrieval | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | With the emergence of social media, voluminous video clips are uploaded every day, and retrieving the most relevant visual content with a language query becomes critical. Most approaches aim to learn a joint embedding space for plain textual and visual contents without adequately exploiting their intra-modality structures and inter-modality correlations. This paper proposes a novel transformer that explicitly disentangles the text and video into semantic roles of objects, spatial contexts and temporal contexts with an attention scheme to learn the intra- and inter-role correlations among the three roles to discover discriminative features for matching at different levels. The preliminary results on popular YouCook2 indicate that our approach surpasses a current state-of-the-art method, with a high margin in all metrics. It also overpasses two SOTA methods in terms of two metrics. |