A comparison of neural‑based visual recognisers for speech activity detection

Raza, Sajjadali and Cuayahuitl, Heriberto (2022) A comparison of neural‑based visual recognisers for speech activity detection. International Journal of Speech Technology . ISSN 1572-8110

Full content URL: https://doi.org/10.1007/s10772-021-09956-3

A comparison of neural-based visual recognisers for speech activity detection
Sajj-IJST2022.pdf - Whole Document
Available under License Creative Commons Attribution 4.0 International.

Item Type:Article
Item Status:Live Archive


Existing literature on speech activity detection (SAD) highlights different approaches within neural networks but does not provide a comprehensive comparison to these methods. This is important because such neural approaches often require hardware-intensive resources. In this article, we provide a comparative analysis of three different approaches: classification with still images (CNN model), classification based on previous images (CRNN model), and classification of sequences of images (Seq2Seq model). Our experimental results using the Vid-TIMIT dataset show that the CNN model can achieve an accuracy of 97% whereas the CRNN and Seq2Seq models increase the classification to 99%. Further experiments show that the CRNN model is almost as accurate as the Seq2Seq model (99.1% vs. 99.6% of classification accuracy, respectively) but 57% faster to train (326 vs. 761 secs. per epoch).

Keywords:deep learning, speech activity detection
Subjects:G Mathematical and Computer Sciences > G700 Artificial Intelligence
G Mathematical and Computer Sciences > G760 Machine Learning
G Mathematical and Computer Sciences > G710 Speech and Natural Language Processing
Divisions:College of Science > School of Computer Science
ID Code:49800
Deposited On:30 Jun 2022 13:42

Repository Staff Only: item control page