Paper ID | ARS-6.7 | ||
Paper Title | SPEAKER-INDEPENDENT LIPREADING BY DISENTANGLED REPRESENTATION LEARNING | ||
Authors | Qun Zhang, Shilin Wang, Gongliang Chen, Shanghai Jiao Tong University, China | ||
Session | ARS-6: Image and Video Interpretation and Understanding 1 | ||
Location | Area H | ||
Session Time: | Tuesday, 21 September, 15:30 - 17:00 | ||
Presentation Time: | Tuesday, 21 September, 15:30 - 17:00 | ||
Presentation | Poster | ||
Topic | Image and Video Analysis, Synthesis, and Retrieval: Image & Video Interpretation and Understanding | ||
IEEE Xplore Open Preview | Click here to view in IEEE Xplore | ||
Abstract | With the development of the deep learning technology, automatic lipreading based on deep neural network can achieve reliable results for speakers appeared in the training dataset. However, speaker-independent lipreading, i.e. lipreading for unseen speakers, is still a challenging task, especially when the training samples are quite limited. To improve the recognition performance in the speaker-independent scenario, a new deep neural network structure, named Disentangled Visual Speech Recognition Network (DVSR-Net), is proposed in this paper. DVSR-Net is designed to disentangle the identity-related features and the content-related features from the lip image sequence. To further eliminate the identity information that remained in the content features, a content feature refinement stage is designed in network optimization. By this way, the extracted features are closely related to the content information and irrelevant to the various talking style and thus the speech recognition performance for unseen speakers can be improved. Experiments on two widely used datasets have demonstrated the effectiveness of the proposed network in the speaker-independent scenario. |