Paper ID | MLR-APPL-IVASR-2.1 | ||
Paper Title | TWO-PATHWAY TRANSFORMER NETWORK FOR VIDEO ACTION RECOGNITION | ||
Authors | Bo Jiang, Jiahong Yu, Lei Zhou, Kailin Wu, Yang Yang, Netease Inc, China | ||
Session | MLR-APPL-IVASR-2: Machine learning for image and video analysis, synthesis, and retrieval 2 | ||
Location | Area D | ||
Session Time: | Monday, 20 September, 15:30 - 17:00 | ||
Presentation Time: | Monday, 20 September, 15:30 - 17:00 | ||
Presentation | Poster | ||
Topic | Applications of Machine Learning: Machine learning for image & video analysis, synthesis, and retrieval | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | Traditional two-stream neural networks have shown that both appearance and motion information are important for video action recognition. However, their way of naively averaging two streams' scores at the end of the framework neglects the underlying relationship between these two kinds of information. In this paper, we propose a two-pathway transformer network that uses memory-based attention to explore such relationship, which further improves the classification performance. Specifically, a transformer-based decoder takes one pathway's features as the query while the other's as the key and value. Then based on the similarity matrix estimated by the query and key, relevant information from the value can be selected to enhance the query for the final classification task. Experiments demonstrate that our proposed method outperforms existing fusion strategies at the end of the two-stream methods. |