forked from facebookresearch/CPC_audio
-
Notifications
You must be signed in to change notification settings - Fork 9
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Molugan
committed
Feb 3, 2020
0 parents
commit 323f1df
Showing
52 changed files
with
5,731 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,3 @@ | ||
__pycache__ | ||
*.pyc | ||
.idea |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,76 @@ | ||
# Code of Conduct | ||
|
||
## Our Pledge | ||
|
||
In the interest of fostering an open and welcoming environment, we as | ||
contributors and maintainers pledge to make participation in our project and | ||
our community a harassment-free experience for everyone, regardless of age, body | ||
size, disability, ethnicity, sex characteristics, gender identity and expression, | ||
level of experience, education, socio-economic status, nationality, personal | ||
appearance, race, religion, or sexual identity and orientation. | ||
|
||
## Our Standards | ||
|
||
Examples of behavior that contributes to creating a positive environment | ||
include: | ||
|
||
* Using welcoming and inclusive language | ||
* Being respectful of differing viewpoints and experiences | ||
* Gracefully accepting constructive criticism | ||
* Focusing on what is best for the community | ||
* Showing empathy towards other community members | ||
|
||
Examples of unacceptable behavior by participants include: | ||
|
||
* The use of sexualized language or imagery and unwelcome sexual attention or | ||
advances | ||
* Trolling, insulting/derogatory comments, and personal or political attacks | ||
* Public or private harassment | ||
* Publishing others' private information, such as a physical or electronic | ||
address, without explicit permission | ||
* Other conduct which could reasonably be considered inappropriate in a | ||
professional setting | ||
|
||
## Our Responsibilities | ||
|
||
Project maintainers are responsible for clarifying the standards of acceptable | ||
behavior and are expected to take appropriate and fair corrective action in | ||
response to any instances of unacceptable behavior. | ||
|
||
Project maintainers have the right and responsibility to remove, edit, or | ||
reject comments, commits, code, wiki edits, issues, and other contributions | ||
that are not aligned to this Code of Conduct, or to ban temporarily or | ||
permanently any contributor for other behaviors that they deem inappropriate, | ||
threatening, offensive, or harmful. | ||
|
||
## Scope | ||
|
||
This Code of Conduct applies within all project spaces, and it also applies when | ||
an individual is representing the project or its community in public spaces. | ||
Examples of representing a project or community include using an official | ||
project e-mail address, posting via an official social media account, or acting | ||
as an appointed representative at an online or offline event. Representation of | ||
a project may be further defined and clarified by project maintainers. | ||
|
||
## Enforcement | ||
|
||
Instances of abusive, harassing, or otherwise unacceptable behavior may be | ||
reported by contacting the project team at <[email protected]>. All | ||
complaints will be reviewed and investigated and will result in a response that | ||
is deemed necessary and appropriate to the circumstances. The project team is | ||
obligated to maintain confidentiality with regard to the reporter of an incident. | ||
Further details of specific enforcement policies may be posted separately. | ||
|
||
Project maintainers who do not follow or enforce the Code of Conduct in good | ||
faith may face temporary or permanent repercussions as determined by other | ||
members of the project's leadership. | ||
|
||
## Attribution | ||
|
||
This Code of Conduct is adapted from the [Contributor Covenant][homepage], version 1.4, | ||
available at https://www.contributor-covenant.org/version/1/4/code-of-conduct.html | ||
|
||
[homepage]: https://www.contributor-covenant.org | ||
|
||
For answers to common questions about this code of conduct, see | ||
https://www.contributor-covenant.org/faq |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,36 @@ | ||
# Contributing to CPC_audio | ||
We want to make contributing to this project as easy and transparent as | ||
possible. | ||
|
||
## Pull Requests | ||
We actively welcome your pull requests. | ||
|
||
1. Fork the repo and create your branch from `master`. | ||
2. If you've added code that should be tested, add tests. | ||
3. If you've changed APIs, update the documentation. | ||
4. Ensure the test suite passes. | ||
5. Make sure your code lints. | ||
6. If you haven't already, complete the Contributor License Agreement ("CLA"). | ||
|
||
## Contributor License Agreement ("CLA") | ||
In order to accept your pull request, we need you to submit a CLA. You only need | ||
to do this once to work on any of Facebook's open source projects. | ||
|
||
Complete your CLA here: <https://code.facebook.com/cla> | ||
|
||
## Issues | ||
We use GitHub issues to track public bugs. Please ensure your description is | ||
clear and has sufficient instructions to be able to reproduce the issue. | ||
|
||
Facebook has a [bounty program](https://www.facebook.com/whitehat/) for the safe | ||
disclosure of security bugs. In those cases, please go through the process | ||
outlined on that page and do not file a public issue. | ||
|
||
## Coding Style | ||
* 2 spaces for indentation rather than tabs | ||
* 80 character line length | ||
* ... | ||
|
||
## License | ||
By contributing to CPC_audio, you agree that your contributions will be licensed | ||
under the LICENSE file in the root directory of this source tree. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,21 @@ | ||
MIT License | ||
|
||
Copyright (c) Facebook, Inc. and its affiliates. | ||
|
||
Permission is hereby granted, free of charge, to any person obtaining a copy | ||
of this software and associated documentation files (the "Software"), to deal | ||
in the Software without restriction, including without limitation the rights | ||
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell | ||
copies of the Software, and to permit persons to whom the Software is | ||
furnished to do so, subject to the following conditions: | ||
|
||
The above copyright notice and this permission notice shall be included in all | ||
copies or substantial portions of the Software. | ||
|
||
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR | ||
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, | ||
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE | ||
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER | ||
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, | ||
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE | ||
SOFTWARE. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,217 @@ | ||
# CPC_audio | ||
|
||
This code implements the Contrast Predictive Coding algorithm on audio data, as described in the paper [Unsupervised Pretraining Transfers well Across Languages](FILLME). This is an unsupervised method to train audio features directly from the raw waveform. | ||
|
||
Moreover, this code also implements all the evaluation metrics used in the paper: | ||
- [ABX discriminability](https://zerospeech.com/2017/track_1.html) | ||
- [Phone and speaker linear separability](https://arxiv.org/abs/1807.03748) | ||
- Transfer learning on other languages, using the [common voices datasets](https://voice.mozilla.org/en/datasets) | ||
|
||
## Setup instructions | ||
|
||
The installation is a tiny bit involved due to the torch-audio dependency. | ||
|
||
0/ Clone the repo: | ||
`git clone [email protected]:facebookresearch/CPC_audio.git && cd CPC_audio` | ||
|
||
1/ Install libraries which would be required for torch-audio https://github.com/pytorch/audio : | ||
* MacOS: `brew install sox` | ||
* Linux: `sudo apt-get install sox libsox-dev libsox-fmt-all` | ||
|
||
2/ `conda env create -f environment.yml && conda activate cpc37` | ||
|
||
3/ Run setup.py | ||
`python setup.py develop` | ||
|
||
You can test your installation with: | ||
`nosetests -d` | ||
|
||
### CUDA driver | ||
|
||
This setup is given for CUDA 9.2 if you use a different version of CUDA then please change the version of cudatoolkit in environment.yml. | ||
For more information on the cudatoolkit version to use, please check https://pytorch.org/ | ||
|
||
### Standard datasets | ||
|
||
We suggest to train the model either on [Librispeech](http://www.openslr.org/12/) or [libri-light](https://github.com/facebookresearch/libri-light). | ||
|
||
|
||
## How to run a session | ||
|
||
To run a new training session, use: | ||
|
||
```bash | ||
python cpc/train.py --pathDB $PATH_AUDIO_FILES --pathCheckpoint $PATH_CHECKPOINT_DIR --pathTrain $TRAINING_SET --pathVal $VAL_SET --file_extension $EXTENSION | ||
``` | ||
|
||
Where: | ||
- $PATH_AUDIO_FILES is the directory containing the audio files. The files should be arranged as below: | ||
``` | ||
PATH_AUDIO_FILES | ||
│ | ||
└───speaker1 | ||
│ └───... | ||
│ │ seq_11.{$EXTENSION} | ||
│ │ seq_12.{$EXTENSION} | ||
│ │ ... | ||
│ | ||
└───speaker2 | ||
└───... | ||
│ seq_21.{$EXTENSION} | ||
│ seq_22.{$EXTENSION} | ||
``` | ||
|
||
Please note that each speaker directory can contain an arbitrary number of subdirectories: the speaker label will always be retrieved from the top one. The name of the files isn't relevant. For a concrete example, you can look at the organization of the [Librispeech](http://www.openslr.org/12/) dataset. | ||
|
||
- $PATH_CHECKPOINT_DIR in the directory where the checkpoints will be saved | ||
- $TRAINING_SET is a path to a .txt file containing the list of the training sequences (see [here](https://drive.google.com/drive/folders/1BhJ2umKH3whguxMwifaKtSra0TgAbtfb) for example) | ||
- $VALIDATION_SET is a path to a .txt file containing the list of the validation sequences | ||
- $EXTENSION is the extension of each audio file | ||
|
||
## Custom architectures | ||
|
||
The code allows you to train a wide range of architectures. For example, to train the CPC method as described in [Van Den Oord's paper](https://arxiv.org/abs/1807.03748) just run: | ||
|
||
```bash | ||
python cpc/train.py --pathDB $PATH_AUDIO_FILES --pathCheckpoint $PATH_CHECKPOINT_DIR --pathTrain $TRAINING_SET --pathVal $VAL_SET --file_extension $EXTENSION --normMode batchNorm --rnnMode linear | ||
``` | ||
|
||
Or if you want to train a model with a FFD prediction network instead of a transformer: | ||
```bash | ||
python cpc/train.py --pathDB $PATH_AUDIO_FILES --pathCheckpoint $PATH_CHECKPOINT_DIR --pathTrain $TRAINING_SET --pathVal $VAL_SET --file_extension $EXTENSION --rnnMode ffd --schedulerRamp 10 | ||
``` | ||
|
||
The --schedulerRamp option add a learning rate ramp at the beginning of the training: it barely affects the performance of a model with a transformer predictor but is necessary with other models. | ||
|
||
Launch cpc/train.py -h to see all the possible options. | ||
|
||
## How to restart a session | ||
|
||
To restart a session from the last saved checkpoint just run | ||
```bash | ||
python cpc/train.py --pathCheckpoint $PATH_CHECKPOINT_DIR | ||
``` | ||
## How to run an evaluation session | ||
|
||
All evaluation scripts can be found in cpc/eval/. | ||
|
||
### Linear separability: | ||
|
||
After training, the CPC model can output high level features for a variety of tasks. For an input audio file sampled at 16kHz, the provided baseline model will output 256 dimensional output features every 10ms. We provide two linear separability tests one for speaker, one for phonemes, in which a linear classifier is trained on top of the CPC features with aligned labels, and evaluated on a held-out test set. | ||
|
||
Train / Val splits as well as phone alignments for librispeech-100h can be found [here](https://drive.google.com/drive/folders/1BhJ2umKH3whguxMwifaKtSra0TgAbtfb). | ||
|
||
|
||
Speaker separability: | ||
|
||
```bash | ||
python cpc/eval/linear_separability.py $PATH_DB $TRAINING_SET $VAL_SET $CHECKPOINT_TO_LOAD --pathCheckpoint $PATH_CHECKPOINT | ||
``` | ||
|
||
Phone separability: | ||
```bash | ||
python cpc/eval/linear_separability.py $PATH_DB $TRAINING_SET $VAL_SET $CHECKPOINT_TO_LOAD --pathCheckpoint $PATH_CHECKPOINT --pathPhone $PATH_TO_PHONE_LABELS | ||
``` | ||
|
||
You can also concatenate the output features of several model by providing several checkpoint to the --load option. For example the following command line: | ||
|
||
```bash | ||
python cpc/eval/linear_separability.py -$PATH_DB $TRAINING_SET $VAL_SET model1.pt model2.pt --pathCheckpoint $PATH_CHECKPOINT | ||
``` | ||
|
||
Will evaluate the speaker separability of the concatenation of the features from model1 and model2. | ||
|
||
|
||
### ABX score: | ||
|
||
You can run the ABX score on the [Zerospeech2017 dataset](https://zerospeech.com/2017/index.html). To begin, download the dataset [here](https://download.zerospeech.com/). Then run the ABX evaluation on a given checkpoint with: | ||
|
||
```bash | ||
python ABX.py from_checkpoint $PATH_CHECKPOINT $PATH_ITEM_FILE $DATASET_PATH --seq_norm --strict --file_extension .wav --out $PATH_OUT | ||
``` | ||
Where: | ||
- $PATH_CHECKPOINT is the path pointing to the checkpoint to evaluate | ||
- $PATH_ITEM_FILE is the path to the .item file containing the triplet annotations | ||
- $DATASET_PATH path to the directory containing the audio files | ||
- $PATH_OUT path to the directory into which the results should be dumped | ||
- --seq_norm normalize each batch of features across the time channel before computing ABX | ||
- --strict forces each batch of features to contain exactly the same number of frames. | ||
|
||
### Cross lingual transfer | ||
|
||
To begin download the common voices datasets [here](https://voice.mozilla.org/en/datasets), you will also need to download our phonem annotations and our train / val / test splits for each language [here](https://dl.fbaipublicfiles.com/cpc_audio/common_voices_splits.tar.gz). Then unzip your data at PATH_COMMON_VOICES. | ||
Unfortunately, the audio files in common voices don't have the same sampling rate as in Librispeech. Thus you'll need to convert them into 16kH audio using the command: | ||
|
||
```bash | ||
DIR_CC=$PATH_COMMON_VOICES | ||
for x in fr zh it ru nl sv es tr tt ky; do python cpc/eval/utils/adjust_sample_rate.py ${DIR_CC}/${x}/clips ${DIR_CC}/${x}/validated_phones_reduced.txt ${DIR_CC}/${x}/clips_16k; done | ||
``` | ||
|
||
You can now run the experiments described in the paper. To begin, you must train the linear classifier. You will find below the instructions for the Spanish dataset: you can run the experiments on any other dataset in the same fashion. | ||
|
||
#### Frozen features | ||
|
||
To run the training on frozen features with the one hour dataset, just run: | ||
|
||
```bash | ||
python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt --pathVal $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR | ||
``` | ||
|
||
#### Fine tuning | ||
|
||
The command is quite similar to run the fine-tuning experiments on the 5 hours dataset. For example in French you need to run: | ||
```bash | ||
python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --pathVal $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR | ||
``` | ||
|
||
#### PER | ||
|
||
Once the training is done, you can compute the associated phone error rate (PER) on the test subset. To do so, just run: | ||
|
||
```bash | ||
python cpc/eval/common_voices_eval.py per $OUTPUT_DIR --pathVal $PATH_COMMON_VOICES/es/testSeqs_uniform_new_version.txt --pathPhone $PATH_COMMON_VOICES/es/validated_phones_reduced.txt | ||
``` | ||
|
||
## torch hub | ||
|
||
To begin download the common voices datasets [here](https://voice.mozilla.org/en/datasets), you will also need to download our phonem annotations and our train / val / test splits for each language [here](https://dl.fbaipublicfiles.com/cpc_audio/common_voices_splits.tar.gz). Then unzip your data at PATH_COMMON_VOICES. | ||
Unfortunately, the audio files in common voices don't have the same sampling rate as in Librispeech. Thus you'll need to convert them into 16kH audio using the command: | ||
|
||
```bash | ||
DIR_CC=$PATH_COMMON_VOICES | ||
for x in fr zh it ru nl sv es tr tt ky; do python cpc/eval/utils/adjust_sample_rate.py ${DIR_CC}/${x}/clips ${DIR_CC}/${x}/validated_phones_reduced.txt ${DIR_CC}/${x}/clips_16k; done | ||
``` | ||
|
||
You can now run the experiments described in the paper. To begin, you must train the linear classifier. You will find below the instructions for the Spanish dataset: you can run the experiments on any other dataset in the same fashion. | ||
|
||
#### Frozen features | ||
|
||
To run the training on frozen features with the one hour dataset, just run: | ||
|
||
```bash | ||
python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt --pathVal $PATH_COMMON_VOICES/es/trainSeqs_1.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR | ||
``` | ||
|
||
#### Fine tuning | ||
|
||
The command is quite similar to run the fine-tuning experiments on the 5 hours dataset. For example in French you need to run: | ||
```bash | ||
python cpc/eval/common_voices_eval.py train $PATH_COMMON_VOICES/es/clips_16k $PATH_COMMON_VOICES/es/validated_phones_reduced.txt $CHECKPOINT_TO_TEST --pathTrain $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --pathVal $PATH_COMMON_VOICES/es/trainSeqs_5.0_uniform_new_version.txt --freeze -o $OUTPUT_DIR | ||
``` | ||
|
||
#### PER | ||
|
||
Once the training is done, you can compute the associated phone error rate (PER) on the test subset. To do so, just run: | ||
|
||
```bash | ||
python cpc/eval/common_voices_eval.py per $OUTPUT_DIR --pathVal $PATH_COMMON_VOICES/es/testSeqs_uniform_new_version.txt --pathPhone $PATH_COMMON_VOICES/es/validated_phones_reduced.txt | ||
``` | ||
|
||
## torch hub | ||
|
||
This model is also available via [torch.hub](https://pytorch.org/docs/stable/hub.html). For more details, have a look at hubconf.py. | ||
|
||
|
||
## License | ||
|
||
CPC_audio is MIT licensed, as found in the LICENSE file. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
# Repository's architecture | ||
|
||
train.py : main script | ||
|
||
dataset.py : defintion of the Librispeech dataset format | ||
|
||
model.py : Basic encoders and AR models | ||
|
||
feature_loader.py: different tools to load and save a CPC model. | ||
|
||
transformers.py: an implementation of transformers | ||
|
||
unit_tests.py : unit tests | ||
|
||
criterion/: definition of the training criterions. Three criterion are currently available: CPC (unsupervised), speaker classification and phone classification. | ||
|
||
eval/: evaluation scripts. | ||
|
||
utils/: system utilities and misc. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,4 @@ | ||
# Copyright (c) Facebook, Inc. and its affiliates. | ||
# | ||
# This source code is licensed under the MIT license found in the | ||
# LICENSE file in the root directory of this source tree. |
Oops, something went wrong.