Conventional speech recognition systems have been around for nearly a century and traditionally use GMM-HMM systems for acoustic modelling and n-grams for language modelling. But these models have very stringent and sometimes questionable assumptions(e.g. the assumption that observations are independent for HMMs). Recently, deep Bidrectional LSTM based Language models and pure RNN-CTC based acoustic models have shown to outperform the classical systems. I plan to study in great detail and implement these end-to-end ASR systems as a part of Bachelors Thesis Project.
First up I present a detailed literature survey , asking and answering questions ranging from what was the problem with conventional systems, what were the first end-to-end systems that were proposed to what is the state-of-the-art in this field right now.
Slides 1: In this I discuss the various codes/libraries available for end-to-end ASR and discuss in brief some recent system implementations
First up I present a detailed literature survey , asking and answering questions ranging from what was the problem with conventional systems, what were the first end-to-end systems that were proposed to what is the state-of-the-art in this field right now.
Slides 1: In this I discuss the various codes/libraries available for end-to-end ASR and discuss in brief some recent system implementations
Slides 2: To start the presentation, I go all the way back to 2006, when CTC was first introduced. Building from that , I discuss in detail the following papers:
1)Connectionist Temporal Classification: Labeling Unsegmented Sequence Data with Recurrent Neural Networks (https://www.cs.toronto.edu/~graves/icml_2006.pdf )
2)Sequence Transduction with Recurrent Neural Networks (https://arxiv.org/pdf/1211.3711.pdf )
Following is a list of some interesting resources pertaining to CTC that i found out about while researching:
- https://distill.pub/2017/ctc/
- https://gab41.lab41.org/speech-recognition-you-down-with-ctc-8d3b558943f0
- https://www.youtube.com/watch?v=c86gfVGcvh4
- https://thomasmesnard.github.io/Thomas_Mesnard_Ecole_Normale_Superieure_file/CTC_Report_Mesnard_Auvolat.pdf
- https://github.com/baidu-research/warp-ctc#introduction
- https://web.stanford.edu/class/cs224s/lectures/224s.17.lec8.pdf
Slides 3: I try to discuss in chronological order the developments in End-to-End speech recognition systems. The following papers are discussed in the slides given bellow:
1)Speech Recognition With Deep Recurrent Neural Networks (https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6638947)
2)End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results (https://arxiv.org/pdf/1412.1602.pdf )
3)Deep Speech: Scaling up end-to-end speech recognition(https://arxiv.org/pdf/1412.5567.pdf )
1)Speech Recognition With Deep Recurrent Neural Networks (https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=6638947)
2)End-to-end Continuous Speech Recognition using Attention-based Recurrent NN: First Results (https://arxiv.org/pdf/1412.1602.pdf )
3)Deep Speech: Scaling up end-to-end speech recognition(https://arxiv.org/pdf/1412.5567.pdf )
Slides 4: Discussion of the following papers:
1)Towards End-to-End Speech Recognition with Recurrent Neural Networks (link: http://proceedings.mlr.press/v32/graves14.pdf )
2)Listen, Attend And Spell: A Neural Network For LVCSR ( link: https://ai.google/research/pubs/pub44926 )
Slides 5: Digging Deep into Attention - why is useful , how does it work , why has become an integral part of E2E encoder-decoder based ASR systems
My B.Tech. thesis discussing the experiments performed and their corresponding results in detail is presented below. We were able to achieve state of the art results and our work was accepted as two papers in NCC'20 and SPCOM'20.
Following is a poster we prepared for our thesis defence :