Al-Maqraa

AI Approch

TDNN Speech Recognition Model with CTC

Overview

Time Delay Neural Networks (TDNN) are a cornerstone in traditional and hybrid speech recognition systems. Combining TDNNs with Convolutional Neural Networks (CNN) has led to state-of-the-art results in end-to-end speech recognition. One such model, Jasper (Just Another SPEech Recognizer), developed by NVIDIA in 2019, utilizes Connectionist Temporal Classification (CTC) loss and a block architecture with convolutional sub-blocks and residual connections. This innovative structure enhances data flow efficiency compared to traditional sequential models.

Jasper and QuartzNet

Jasper set benchmarks in English speech recognition but demanded significant computational resources due to its 200+ million parameters. To address this, QuartzNet was introduced as a leaner version, retaining Jasper's architecture but incorporating depthwise separable convolutions. These convolutions reduce computational complexity and improve efficiency by separating spatial and depth dimensions, thus optimizing the model's performance.

Application to Classic Arabic Speech

In our project, QuartzNet is applied to recognize classic Arabic speech, marking its pioneering use in this context. The model's architecture includes an encoder with 15 repeated blocks and a linear decoder that maps output probabilities to character sequences using CTC loss. The Novograd optimization method ensures efficient training by normalizing gradients and decoupling weight decay.

RNN Speech Recognition Model with CTC

Overview

Recurrent Neural Networks (RNN) are pivotal in end-to-end speech recognition, converting audio spectrograms into text. The Deep Speech model exemplifies this with a mix of non-recurrent and recurrent layers, but it struggles with extensive datasets.

Deep Speech 2 Enhancements

Deep Speech 2 improves upon this with multiple CNN and RNN layers, achieving the lowest Word Error Rate (WER) among similar models. Our RNN-CTC model, inspired by Deep Speech 2, features an encoder with four CNN layers, five Bidirectional GRU (BiGRU) layers, and a fully connected layer. This architecture balances feature extraction and frame prediction while being computationally efficient.

Optimization

The AdamW optimization method, a variant of Adam that decouples weight decay from gradient updates, is employed to enhance model convergence and performance.

These models represent significant advancements in speech recognition technology, demonstrating robust performance in various applications.

Our Team

Omar Walid

Software Engineer

Farah Tawfik

Software Engineer

Mohamed Gamal

Software Engineer

Mohamed Gaber

Software Engineer

Ahmed Sallam

Machine Learning Engineer

Our Supervisors

Dr. Hanaa Bayomi

Assistant Professor
Computer Science Department
Faculty of Computers and Artificial Intelligence
Cairo University

h.mobarz@fci-cu.edu.eg

Amany M. Hesham

Assistant Lecturer
Computer Science Department
Faculty of Computers and Artificial Intelligence
Cairo University

amany@fci-cu.edu.eg

Al-Maqraa

Authentication

Quran Recitation

Listening for Imam

Quran Memorization

AI Approch

TDNN Speech Recognition Model with CTC

Overview

Jasper and QuartzNet

Application to Classic Arabic Speech

RNN Speech Recognition Model with CTC

Overview

Deep Speech 2 Enhancements

Optimization

Get the app now!