- Python 3
- Computing power (high-end GPU) and memory space (both RAM/GPU's RAM) is extremely important if you'd like to train your own model.
- Required packages and their use are listed requirements.txt.
Data is collected from the Ted2srt webpage.
Run python3 scraper/preprocess.py
from root directory to scrape and generate dataset.
The script will:
- Scrape data from website.
- Preprocess the data.
- Split the data to train-dev-test sets.
Scraped data is saved at scraper/data/
, processed data will be saved to data/
. Alternatively, download preprocessed data here.
To train each model:
- In the root directory, run the command
python3 main.py --config config/<dataset>/<config_file>.yaml --njobs 8
.
To use our dataset, set <dataset>
as ted
to use scraped data, or libri
to use public data from OpenSRL.
Configuration files are stored as:
Extractor | Classifier | Configuration file |
---|---|---|
MLP | RNN | mlp_rnn.yaml |
CNN | RNN | cnn_rnn.yaml |
ANN | RNN | ann_rnn.yaml |
RNN | RNN | rnn_rnn.yaml |
Extractor | Classifier | Configuration file |
---|---|---|
CNN | MLP | cnn_mlp.yaml |
CNN | CNN | cnn_cnn.yaml |
CNN | ANN | cnn_ann.yaml |
Experiment results are stored at experiment_results.md.
There are two main subcomponents. First is the extractor, the extractor further extracts the audio features for every frame into a latent representation
For our experimentation we firstly fix the classifier to be RNN, and compare how the 4 NN variants perform as the extractor.
Secondly, we fix the Extractor to be CNN. and replace the classifier with the 4 NN variants.
Preprocess scraped data to input into Dataset
and DataLoader
. Includes data cleaning, cutting audio into multiple audio slices according to SRT annotated time and prepare label for each audio slice.
- Input from scraped audio files and SRT.
- Output preprocessed ready dataset for ASR.
Symbols are removed from label and converted to lowercase.
Data that are less accurate are removed. Checking done for SRT that starts at the same time (e.g. 00:00:12,820 --> 00:00:14,820
). SRT that does not include introduction music time is filtered. Laughter and applause is removed.
Raw SRT snippet
1
00:00:12,312 --> 00:00:14,493
Six months ago, I got an email
2
00:00:14,493 --> 00:00:15,900
from a man in Israel
converted to
<audio_id>-1 six months ago i got an email
<audio_id>-2 from a man in israel
and stored at <audio_id>.trans.txt
. Corresponding sliced audio files are named <audio_id>_<audio_index>.mp3
.
After build_dataset()
has preprocessed the data, the data is split into train-dev-test sets.
Sample Rate: 44100
Shape: (1, 84055)
Dtype: torch.float32
- Max: 0.523
- Min: -0.319
- Mean: -0.000
- Std Dev: 0.081
Waveform plot of sample audio signal with length of 1.9s. Duration length can obtained: signal_frames / sample_rate
.
Steps to compute filter banks are motivated to mimic how human perceives audio signals[1].
- Apply pre-emphasis filter on audio signal (amplify the high frequencies since high frequencies have smaller magnitude).
- Cut signal into window frames (assume signal is stationary over a short period time).
- Compute the power spectrum of the signal (Fourier transform) for each window.
Kaldi filter banks transformation applied on audio signals. 40 mel coefficients is kept.
feat_dim = 40
waveform_trans = torchaudio.compliance.kaldi.fbank(signal, frame_length=25, frame_shift=10, num_mel_bins=feat_dim)
plot_spectrogram(waveform_trans.transpose(0, 1).detach(), title="Filter Banks", ylabel='mel bins')
Extractor generates a sequence of feature vectors. Each feature vector is extracted from a small overlapped window of audio frames. Extractor transforms
Extractor includes downsampling of timesteps.
For example downsampling by a factor of 4 from 523 timesteps to 130 timesteps in RNN extractor. Downsampling is also achieved by MaxPooling of CNN extractors.
Classifier generates an output sequence
Original README can be accessed here.
- Liu Shiru (A0187939A)
- Lim Yu Rong, Samuel (A0183921A)
- Yee Xun Wei (A0228597L)
-
Liu, A., Lee, H.-Y., & Lee, L.-S. (2019). Adversarial Training of End-to-end Speech Recognition Using a Criticizing Language Model. Acoustics, Speech and Signal Processing (ICASSP). IEEE.
-
Liu, A. H., Sung, T.-W., Chuang, S.-P., Lee, H.-Y., & Lee, L.-S. (2019). Sequence-to-sequence Automatic Speech Recognition with Word Embedding Regularization and Fused Decoding. arXiv [cs.CL]. Opgehaal van http://arxiv.org/abs/1910.12740