Paper ID | MLR-APPL-IP-7.6 | ||
Paper Title | VISION AND TEXT TRANSFORMER FOR PREDICTING ANSWERABILITY ON VISUAL QUESTION ANSWERING | ||
Authors | Tung Le, Japan Advanced Institute of Science and Technology, Japan; Huy Tien Nguyen, University of Science, Vietnam National University Ho Chi Minh City, Zalo Research Center, Viet Nam; Minh Le Nguyen, Japan Advanced Institute of Science and Technology, Japan | ||
Session | MLR-APPL-IP-7: Machine learning for image processing 7 | ||
Location | Area E | ||
Session Time: | Wednesday, 22 September, 08:00 - 09:30 | ||
Presentation Time: | Wednesday, 22 September, 08:00 - 09:30 | ||
Presentation | Poster | ||
Topic | Applications of Machine Learning: Machine learning for image processing | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Answerability on Visual Question Answering is a novel and attractive task to predict answerable scores between images and questions in multi-modal data. Existing works often utilize a binary mapping from visual question answering systems into Answerability. It does not reflect the essence of this problem. Together with our consideration of Answerability in a regression task, we propose VT-Transformer, which exploits visual and textual features through Transformer architecture. Experimental results on VizWiz 2020 dataset show the effectiveness and robustness of VT-Transformer for Answerability on Visual Question Answering when comparing with competitive baselines. |