Visual Speech Recognition for European Portuguese

Student: Hélder Abreu
Supervisor: Carlos Silva (DEI) e Miguel Dias (Microsoft)

Abstract

Speech recognition based on visual features began in the early 1980s, embedded on Audio- Visual Speech Recognition systems. In fact, the initial purpose to the use of visual cues was to increase the robustness of Automatic Speech Recognition systems, which rapidly lose accuracy in noisy environments. However, the potential to keep a good accuracy, whenever the use of an acoustic stream is excluded and in any other situations where a human lip reader would be needed, led researchers to create and explore the Visual Speech Recognition (VSR) field.
Traditional VSR systems used only RGB information, following an unimodal approach, since the addition of other visual modalities could be expensive and present synchronization issues. The release of the Microsoft Kinect sensor brought new possibilities for the speech recognition fields. This sensor includes a microphone array, a RGB camera and a depth sensor. Furthermore, all its input modalities can be synchronized using the features of its SDK. Recently, Microsoft released the new Kinect One, offering a better camera and a different and improved depth sensing technol- ogy.
This thesis sets the hypothesis that, using the available input HCI modalities of such sensor, such as RGB video and depth, as well as the skeletal tracking features available in the SDK and, by adopting a multimodal VSR articulatory approach, we can improve word recognition rate accu- racy of a VSR system, compared to a unimodal approach using only RGB data.
Regarding the feature extraction process, a recent approaches based on articulatory features have been shown promising results, when compared to standard shape-based viseme approaches. In this thesis, we also aim to verify the hypothesis that an articulatory VSR can outperform a shape- based approach, in what concerns word recognition rate.
The VSR system developed in this thesis, named ViKi (Visual Speech Recognition for Kinect), achieved a 68% word recognition rate on a scenario where 8 speakers, pronounced a vocabulary of 25 isolated words, outperforming our tested unimodal approach. The use of depth information proved to increase the system accuracy, both for the articulatory (+8%) and the shape-based ap- proach (+2%). On a speaker-dependent context, ViKi also achieved an interesting average accuracy of ≈70%. The articulatory approach performed worse than the shape-based, reaching 34% of wordaccuracy, contrary to what happens with previous research based on appearance approaches and not confirming our third hypothesis.

Thesis Download (PDF)