End-to-end Automatic Speech Recognition Systems - PyTorch Implementation

Module Code: CS5242

Semester: AY2021-22 Sem 1

Group 40

Liu Shiru (A0187939A)
Lim Yu Rong, Samuel (A0183921A)
Yee Xun Wei (A0228597L)

Dependencies

Python 3
Computing power (high-end GPU) and memory space (both RAM/GPU's RAM) is extremely important if you'd like to train your own model.
Required packages and their use are listed requirements.txt.

Dataset

Data is collected from the Ted2srt webpage.

Run python3 scraper/preprocess.py from root directory to scrape and generate dataset. The script will:

Scrape data from website.
Preprocess the data.
Split the data to train-dev-test sets.

Scraped data is saved at scraper/data/, processed data will be saved to data/. Alternatively, download preprocessed data here.

Training

To train each model:

In the root directory, run the command python3 main.py --config config/<dataset>/<config_file>.yaml --njobs 8.

Configuration Files

To use our dataset, set <dataset> as ted to use scraped data, or libri to use public data from OpenSRL.

Configuration files are stored as:

Train Extractors

Extractor	Classifier	Configuration file
MLP	RNN	mlp_rnn.yaml
CNN	RNN	cnn_rnn.yaml
ANN	RNN	ann_rnn.yaml
RNN	RNN	rnn_rnn.yaml

Train Classifiers

Extractor	Classifier	Configuration file
CNN	MLP	cnn_mlp.yaml
CNN	CNN	cnn_cnn.yaml
CNN	ANN	cnn_ann.yaml

Experiment results

Experiment results are stored at experiment_results.md.

Model Architecture

There are two main subcomponents. First is the extractor, the extractor further extracts the audio features for every frame into a latent representation $h$. Then we have the classifier, that takes in the latent representation, make prediction for each frame by classifying them into a predefined set of word token such as “a”, “the”, “-tion” etc. Lastly, the Beam search decoding algorithm decode the raw classification results into a sentence. A typical ASR has a CNN extractor and a RNN classifier.

For our experimentation we firstly fix the classifier to be RNN, and compare how the 4 NN variants perform as the extractor.

Secondly, we fix the Extractor to be CNN. and replace the classifier with the 4 NN variants.

Original README can be accessed here.

Reference

Liu, A., Lee, H.-Y., & Lee, L.-S. (2019). Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model. Acoustics, Speech and Signal Processing (ICASSP). IEEE.
Liu, A. H., Sung, T.-W., Chuang, S.-P., Lee, H.-Y., & Lee, L.-S. (2019). Sequence-to-sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding. arXiv [cs.CL]. Opgehaal van http://arxiv.org/abs/1910.12740

Name		Name	Last commit message	Last commit date
Latest commit History 216 Commits
assets		assets
bin		bin
config		config
corpus		corpus
notebooks		notebooks
scraper		scraper
src		src
tests		tests
util		util
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
eval.py		eval.py
experiment_results.md		experiment_results.md
main.py		main.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

End-to-end Automatic Speech Recognition Systems - PyTorch Implementation

Module Code: CS5242

Semester: AY2021-22 Sem 1

Group 40

Dependencies

Dataset

Training

Configuration Files

Train Extractors

Train Classifiers

Experiment results

Model Architecture

Reference

About

Releases

Packages

Languages

License

liushiru/End-to-end-ASR-Pytorch

Folders and files

Latest commit

History

Repository files navigation

End-to-end Automatic Speech Recognition Systems - PyTorch Implementation

Module Code: CS5242

Semester: AY2021-22 Sem 1

Group 40

Dependencies

Dataset

Training

Configuration Files

Train Extractors

Train Classifiers

Experiment results

Model Architecture

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages