Keep user’s data safe
recite a specific (Ayahs) from (Surah)
listen to recorded Quranic recitations by Imams.
User can repeat and practice memorized verses.
Time Delay Neural Networks (TDNN) are a cornerstone in traditional and hybrid speech recognition systems. Combining TDNNs with Convolutional Neural Networks (CNN) has led to state-of-the-art results in end-to-end speech recognition. One such model, Jasper (Just Another SPEech Recognizer), developed by NVIDIA in 2019, utilizes Connectionist Temporal Classification (CTC) loss and a block architecture with convolutional sub-blocks and residual connections. This innovative structure enhances data flow efficiency compared to traditional sequential models.
Jasper set benchmarks in English speech recognition but demanded significant computational resources due to its 200+ million parameters. To address this, QuartzNet was introduced as a leaner version, retaining Jasper's architecture but incorporating depthwise separable convolutions. These convolutions reduce computational complexity and improve efficiency by separating spatial and depth dimensions, thus optimizing the model's performance.
In our project, QuartzNet is applied to recognize classic Arabic speech, marking its pioneering use in this context. The model's architecture includes an encoder with 15 repeated blocks and a linear decoder that maps output probabilities to character sequences using CTC loss. The Novograd optimization method ensures efficient training by normalizing gradients and decoupling weight decay.
Recurrent Neural Networks (RNN) are pivotal in end-to-end speech recognition, converting audio spectrograms into text. The Deep Speech model exemplifies this with a mix of non-recurrent and recurrent layers, but it struggles with extensive datasets.
Deep Speech 2 improves upon this with multiple CNN and RNN layers, achieving the lowest Word Error Rate (WER) among similar models. Our RNN-CTC model, inspired by Deep Speech 2, features an encoder with four CNN layers, five Bidirectional GRU (BiGRU) layers, and a fully connected layer. This architecture balances feature extraction and frame prediction while being computationally efficient.
The AdamW optimization method, a variant of Adam that decouples weight decay from gradient updates, is employed to enhance model convergence and performance.
These models represent significant advancements in speech recognition technology, demonstrating robust performance in various applications.
Our Team
Our Supervisors
Dr. Hanaa Bayomi
Assistant Professor Computer Science Department Faculty of Computers and Artificial Intelligence Cairo University
h.mobarz@fci-cu.edu.eg
Amany M. Hesham
Assistant Lecturer Computer Science Department Faculty of Computers and Artificial Intelligence Cairo University
amany@fci-cu.edu.eg