Skip to content

Commit 0f6e64c

Browse files
authored
Minor improvements (#20)
- Flores dataset importer - custom dataset importer - ability to use a pre-trained backward model - save experiment config on start - stubs for dataset caching ( decided to sync implementation with workflow manager integration ) - use best bleu models instead of best ce-mean-words - fix linting warnings
1 parent ec783cf commit 0f6e64c

File tree

15 files changed

+182
-47
lines changed

15 files changed

+182
-47
lines changed

README.md

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -133,14 +133,17 @@ TRAIN_DATASETS="opus_OPUS-ParaCrawl/v7.1 mtdata_newstest2019_ruen"
133133
TEST_DATASETS="sacrebleu_wmt20 sacrebleu_wmt18"
134134
```
135135

136-
Data source | Prefix | Name example | Type | Comments
136+
Data source | Prefix | Name examples | Type | Comments
137137
--- | --- | --- | ---| ---
138138
[MTData](https://github.com/thammegowda/mtdata) | mtdata | newstest2017_ruen | corpus | Supports many datasets. Run `mtdata list -l ru-en` to see datasets for a specific language pair.
139139
[OPUS](opus.nlpl.eu/) | opus | ParaCrawl/v7.1 | corpus | Many open source datasets. Go to the website, choose a language pair, check links under Moses column to see what names and version is used in a link.
140140
[SacreBLEU](https://github.com/mjpost/sacrebleu) | sacrebleu | wmt20 | corpus | Official evaluation datasets available in SacreBLEU tool. Recommended to use in `TEST_DATASETS`. Look up supported datasets and language pairs in `sacrebleu.dataset` python module.
141+
[Flores](https://github.com/facebookresearch/flores) | flores | dev, devtest | corpus | Evaluation dataset from Facebook that supports 100 languages.
142+
Custom parallel | custom-corpus | /tmp/test-corpus | corpus | Custom parallel dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz"
141143
[Paracrawl](https://paracrawl.eu/) | paracrawl-mono | paracrawl8 | mono | Datasets that are crawled from the web. Only [mono datasets](https://paracrawl.eu/index.php/moredata) are used in this importer. Parallel corpus is available using opus importer.
142144
[News crawl](http://data.statmt.org/news-crawl) | news-crawl | news.2019 | mono | Some news monolingual datasets from [WMT21](https://www.statmt.org/wmt21/translation-task.html)
143145
[Common crawl](https://commoncrawl.org/) | commoncrawl | wmt16 | mono | Huge web crawl datasets. The links are posted on [WMT21](https://www.statmt.org/wmt21/translation-task.html)
146+
Custom mono | custom-mono | /tmp/test-mono | mono | Custom monolingual dataset that is already downloaded to a local disk. The dataset name is an absolute path prefix without ".lang.gz"
144147

145148
You can also use [find-corpus](pipeline/utils/find-corpus.py) tool to find all datasets for an importer and get them formatted to use in config.
146149

config.sh

Lines changed: 9 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -11,8 +11,10 @@ set -a
1111

1212
WORKDIR=$(pwd)
1313
CUDA_DIR=/usr/local/cuda-11.2
14-
DATA_DIR=${DATA_DIR:-${WORKDIR}/data}
15-
MODELS_DIR=${MODELS_DIR:-${WORKDIR}/models}
14+
DATA_ROOT_DIR=${DATA_ROOT_DIR:-${WORKDIR}}
15+
DATA_DIR=${DATA_ROOT_DIR}/data
16+
MODELS_DIR=${DATA_ROOT_DIR}/models
17+
EXPERIMENTS_DIR=${DATA_ROOT_DIR}/experiments
1618
MARIAN=${MARIAN:-${WORKDIR}/3rd_party/marian-dev/build}
1719
CLEAN_TOOLS=${WORKDIR}/pipeline/clean/tools
1820
BIN=${WORKDIR}/bin
@@ -23,11 +25,14 @@ EXPERIMENT=test
2325
SRC=ru
2426
TRG=en
2527

28+
# path to a pretrained backward model (optional)
29+
BACKWARD_MODEL=""
30+
2631
# parallel corpus
2732
TRAIN_DATASETS="opus_ada83/v1 opus_UN/v20090831 opus_GNOME/v1 opus_wikimedia/v20210402 opus_CCMatrix/v1 opus_Wikipedia/v1.0 opus_tico-19/v2020-10-28 opus_KDE4/v2 opus_OpenSubtitles/v2018 opus_MultiUN/v1 opus_GlobalVoices/v2018q4 opus_ELRC_2922/v1 opus_PHP/v1 opus_Tatoeba/v2021-03-10 opus_Tanzil/v1 opus_XLEnt/v1.1 opus_TildeMODEL/v2018 opus_Ubuntu/v14.10 opus_TED2013/v1.1 opus_infopankki/v1 opus_EUbookshop/v2 opus_ParaCrawl/v8 opus_Books/v1 opus_WMT-News/v2019 opus_bible-uedin/v1 opus_WikiMatrix/v1 opus_QED/v2.0a opus_CCAligned/v1 opus_TED2020/v1 opus_News-Commentary/v16 opus_UNPC/v1.0"\
2833
" mtdata_cc_aligned mtdata_airbaltic mtdata_GlobalVoices_2018Q4 mtdata_UNv1_test mtdata_neulab_tedtalksv1_train mtdata_neulab_tedtalksv1_dev mtdata_wmt13_commoncrawl mtdata_czechtourism mtdata_paracrawl_bonus mtdata_worldbank mtdata_wiki_titles_v1 mtdata_WikiMatrix_v1 mtdata_wmt18_news_commentary_v13 mtdata_wiki_titles_v2 mtdata_news_commentary_v14 mtdata_UNv1_dev mtdata_neulab_tedtalksv1_test mtdata_JW300"
29-
DEVTEST_DATASETS="mtdata_newstest2019_ruen mtdata_newstest2017_ruen mtdata_newstest2015_ruen mtdata_newstest2014_ruen"
30-
TEST_DATASETS="sacrebleu_wmt20 sacrebleu_wmt18 sacrebleu_wmt16 sacrebleu_wmt13"
34+
DEVTEST_DATASETS="flores_dev mtdata_newstest2019_ruen mtdata_newstest2017_ruen mtdata_newstest2015_ruen mtdata_newstest2014_ruen"
35+
TEST_DATASETS="flores_devtest sacrebleu_wmt20 sacrebleu_wmt18 sacrebleu_wmt16 sacrebleu_wmt13"
3136
# monolingual datasets (ex. paracrawl-mono_paracrawl8, commoncrawl_wmt16, news-crawl_news.2020)
3237
# to be translated by the teacher model
3338
MONO_DATASETS_SRC="news-crawl_news.2020 news-crawl_news.2019 news-crawl_news.2018 news-crawl_news.2017 "\

pipeline/clean/ce-filter.sh

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -29,7 +29,7 @@ fi
2929

3030
# Part of the data to be removed (0.05 is 5%)
3131
remove=0.05
32-
model="${model_dir}/model.npz.best-ce-mean-words.npz"
32+
model="${model_dir}/model.npz.best-bleu-detok.npz"
3333
vocab="${model_dir}/vocab.spm"
3434
output_dir=$(dirname "${output_prefix}")
3535
dir="${output_dir}/scored"
@@ -68,7 +68,7 @@ echo "### Sorting scores"
6868
if [ ! -s "${dir}/sorted.gz" ]; then
6969
buffer_size="$(echo "$(grep MemTotal /proc/meminfo | awk '{print $2}')"*0.9 | bc | cut -f1 -d.)"
7070
paste "${dir}/scores.nrm.txt" "${dir}/corpus.${SRC}" "${dir}/corpus.${TRG}" |
71-
LC_ALL=C sort -n -k1,1 -S "${buffer_size}K" |
71+
LC_ALL=C sort -n -k1,1 -S "${buffer_size}K" -T "${dir}" |
7272
pigz >"${dir}/sorted.gz"
7373
fi
7474

pipeline/clean/clean-mono.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -35,7 +35,7 @@ test -s "${output}.${lang}.gz" || test -s "${output}.${lang}.nrm.gz" ||
3535
echo "### Deduplication"
3636
test -s "${output}.${lang}.gz" || test -s "${output}.${lang}.nrm.uniq.gz" ||
3737
pigz -dc "${output}.${lang}.nrm.gz" |
38-
LC_ALL=C sort -S 10G |
38+
LC_ALL=C sort -S 10G -T "${output}" |
3939
uniq |
4040
pigz >"${output}.${lang}.nrm.uniq.gz"
4141

pipeline/data/download-corpus.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ test -v SRC
1515
test -v TRG
1616

1717
prefix=$1
18+
cache=$2
1819

1920
src_corpus="${prefix}.${SRC}.gz"
2021
trg_corpus="${prefix}.${TRG}.gz"
@@ -25,7 +26,7 @@ mkdir -p "${dir}"
2526
if [ ! -e "${trg_corpus}" ]; then
2627
echo "### Downloading datasets"
2728

28-
for dataset in "${@:2}"; do
29+
for dataset in "${@:3}"; do
2930
echo "### Downloading dataset ${dataset}"
3031
name=${dataset#*_}
3132
type=${dataset%%_*}

pipeline/data/download-eval.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,11 +15,11 @@ test -v WORKDIR
1515
test -v TEST_DATASETS
1616

1717
dir=$1
18+
cache=$2
1819

19-
20-
for dataset in "${@:2}"; do
20+
for dataset in "${@:3}"; do
2121
name="${dataset//[^A-Za-z0-9_- ]/_}"
22-
bash "${WORKDIR}/pipeline/data/download-corpus.sh" "${dir}/${name}" "${dataset}"
22+
bash "${WORKDIR}/pipeline/data/download-corpus.sh" "${dir}/${name}" "${cache}" "${dataset}"
2323

2424
test -e "${dir}/${name}.${SRC}" || pigz -dk "${dir}/${name}.${SRC}.gz"
2525
test -e "${dir}/${name}.${TRG}" || pigz -dk "${dir}/${name}.${TRG}.gz"

pipeline/data/download-mono.sh

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,7 @@ echo "###### Downloading monolingual data"
1414
lang=$1
1515
max_sent=$2
1616
prefix=$3
17+
cache=$4
1718

1819
file_name="${prefix}.${lang}.gz"
1920
dir=$(dirname "${prefix}")/mono
@@ -23,7 +24,7 @@ if [ ! -e "${file_name}" ]; then
2324
mkdir -p "${dir}"
2425
coef=0.1
2526

26-
for dataset in "${@:4}"; do
27+
for dataset in "${@:5}"; do
2728
echo "### Downloading dataset ${dataset}"
2829
source_prefix="${dir}/${dataset}.original.${lang}"
2930
gz_path="${dir}/${dataset}.${lang}.gz"
Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
#!/bin/bash
2+
##
3+
# Use custom dataset that is already downloaded to a local disk
4+
# Local path prefix without `.<lang_code>.gz` should be specified as a "dataset" parameter
5+
#
6+
# Usage:
7+
# bash custom-corpus.sh source target dir dataset
8+
#
9+
10+
set -x
11+
set -euo pipefail
12+
13+
echo "###### Copying custom corpus"
14+
15+
src=$1
16+
trg=$2
17+
dir=$3
18+
dataset=$4
19+
20+
cp "${dataset}.${src}.gz" "${dir}/"
21+
cp "${dataset}.${trg}.gz" "${dir}/"
22+
23+
24+
echo "###### Done: Copying custom corpus"
Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,52 @@
1+
#!/bin/bash
2+
##
3+
# Downloads flores dataset
4+
# Dataset type can be "dev" or "devtest"
5+
#
6+
# Usage:
7+
# bash flores.sh source target dir dataset
8+
#
9+
10+
set -x
11+
set -euo pipefail
12+
13+
echo "###### Downloading flores corpus"
14+
15+
src=$1
16+
trg=$2
17+
dir=$3
18+
dataset=$4
19+
20+
tmp="${dir}/flores"
21+
mkdir -p "${tmp}"
22+
23+
test -s "${tmp}/flores101_dataset.tar.gz" ||
24+
wget -O "${tmp}/flores101_dataset.tar.gz" "https://dl.fbaipublicfiles.com/flores101/dataset/flores101_dataset.tar.gz"
25+
26+
tar -xzf "${tmp}/flores101_dataset.tar.gz" -C "${tmp}" --no-same-owner
27+
28+
source "${WORKDIR}/pipeline/setup/activate-python.sh"
29+
30+
flores_code() {
31+
code=$1
32+
33+
if [ "${code}" == "zh" ] || [ "${code}" == "zh-Hans" ]; then
34+
flores_code="zho_simpl"
35+
elif [ "${code}" == "zh-Hant" ]; then
36+
flores_code="zho_trad"
37+
else
38+
flores_code=$(python -c "from mtdata.iso import iso3_code; print(iso3_code('${code}', fail_error=True))")
39+
fi
40+
41+
echo "${flores_code}"
42+
}
43+
44+
src_flores=$(flores_code "${src}")
45+
trg_flores=$(flores_code "${trg}")
46+
47+
cp "${tmp}/flores101_dataset/${dataset}/${src_flores}.${dataset}" "${dir}/flores.${src}"
48+
cp "${tmp}/flores101_dataset/${dataset}/${trg_flores}.${dataset}" "${dir}/flores.${trg}"
49+
50+
rm -rf "${tmp}"
51+
52+
echo "###### Done: Downloading flores corpus"
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
#!/bin/bash
2+
##
3+
# Use custom monolingual dataset that is already downloaded to a local disk
4+
# Local path prefix without `.<lang_code>.gz` should be specified as a "dataset" parameter
5+
#
6+
# Usage:
7+
# bash custom-mono.sh lang output_prefix dataset
8+
#
9+
10+
set -x
11+
set -euo pipefail
12+
13+
echo "###### Copying custom monolingual dataset"
14+
15+
lang=$1
16+
output_prefix=$2
17+
dataset=$3
18+
19+
cp "${dataset}.${lang}.gz" "${output_prefix}.${lang}.gz"
20+
21+
22+
echo "###### Done: Copying custom monolingual dataset"

0 commit comments

Comments
 (0)