Exploratory Data Analysis #6

martysai · 2021-02-05T12:49:21Z

martysai
Feb 5, 2021
Maintainer

To use TransCoder it is necessary to understand the data. After Python150k preprocessing we get functions:

def __init__ self request settings self . _dev settings . get dev False self . _host settings . get static_host request . host self . _static_url_prefix settings static_url_prefix self . _manifest settings assetgen_manifest self . _subdomains cycle settings . get static_subdomains 12345

And their docstrings:

Get a fully expanded url for the given static resource path . If we're in production then appends a subdomain to the beginning of the host to avoid too many connections to the same url .

The following TO-DO list describes how to handle this data with TransCoder:

Is it possible to feed this data into their preprocessing?
Find their raw training data;
Describe .tok and .pth data formats;

Consider BigQuery dataset described here: TransCoder.

Let's consider src/data/loader.py for getting into valid.python.pth inner contents.

Same way we can consider an example of python dataset in the TransCoder: /source-code-summarization/transcoder/transcoder/data/test_dataset/python

After running pytest preprocessing/test_preprocess.py I got the following type of data:

<DOCUMENT_ID="chenlian2015/skia_from_google/tree/master/tools/skp/page_sets/skia_youtube_desktop.py"> # ▁ Copyright ▁ 2014 ▁ The ▁ Chromium ▁ Authors . ▁ All ▁ rights ▁ reserved . ENDCOM # ▁ Use ▁ of ▁ this ▁ source ▁ code ▁ is ▁ governed ▁ by ▁ a ▁ BSD - style ▁ license ▁ that ▁ can ▁ be ENDCOM # ▁ found ▁ in ▁ the ▁ LICENSE ▁ file . ENDCOM # ▁ pylint <...> NEW_LINE INDENT action_runner . NavigateToPage ( self ) NEW_LINE action_runner . Wait ( 25 ) NEW_LINE DEDENT DEDENT class SkiaYoutubeDesktopPageSet ( page_set_module . PageSet ) : NEW_LINE INDENT """ ▁ Pages ▁ designed ▁ to ▁ represent ▁ the ▁ median , ▁ not ▁ highly ▁ optimized ▁ web ▁ """ NEW_LINE def __init__ ( self ) : NEW_LINE INDENT super ( SkiaYoutubeDesktopPageSet , self )

In order to run preprocess:

python -m preprocessing.preprocess /home/marat/source-code-summarization/transcoder/transcoder/data/test_dataset --lang1 python --lang3 cpp --keep_comments True --bpe_train_size 0 --test_size 10 --local True

Current problem:

AssertionError: failed to learn bpe on /home/marat/source-code-summarization/transcoder/transcoder/data/test_dataset/cpp-python-.with_comments/train.with_comments.tok.NoneGB

Fixed: store both JSONs in gzip-compressed mode.

Then after preprocess we get the following .XLM-syml folder:

test.cpp.pth        train.cpp.6.pth     train.python.0.pth     train.python_sa.2.pth
test.cpp_sa.pth     train.cpp.7.pth     train.python.1.pth     train.python_sa.3.pth
test.python.pth     train.cpp_sa.0.pth  train.python.2.pth     train.python_sa.4.pth
test.python_sa.pth  train.cpp_sa.1.pth  train.python.3.pth     train.python_sa.5.pth
train.cpp.0.pth     train.cpp_sa.2.pth  train.python.4.pth     train.python_sa.6.pth
train.cpp.1.pth     train.cpp_sa.3.pth  train.python.5.pth     train.python_sa.7.pth
train.cpp.2.pth     train.cpp_sa.4.pth  train.python.6.pth     valid.cpp.pth
train.cpp.3.pth     train.cpp_sa.5.pth  train.python.7.pth     valid.cpp_sa.pth
train.cpp.4.pth     train.cpp_sa.6.pth  train.python_sa.0.pth  valid.python.pth
train.cpp.5.pth     train.cpp_sa.7.pth  train.python_sa.1.pth  valid.python_sa.pth

Now transform it to be put onto single GPU. That should be done at the end, see issue.

For now just create a symlink from train.python.0.pth to train.python.pth.

A command to pretrain with MLM:

python XLM/train.py --n_heads 1 --bt_steps '' --max_vocab 8000 --word_mask_keep_rand '0.8,0.1,0.1' --word_blank 0 --data_path '/home/marat/source-code-summarization/transcoder/transcoder/data/test_dataset/cpp-python-.with_comments.XLM-syml'  --save_periodic 0  --bptt 512  --lambda_clm 1  --ae_steps '' --fp16 true --share_inout_emb true --lambda_mlm 1 --sinusoidal_embeddings false  --word_shuffle 0  --mlm_steps 'cpp,python'  --attention_dropout 0  --split_data false  --length_penalty 1  --max_epoch 100000  --stopping_criterion '_valid_mlm_ppl,10' --lambda_bt 1 --dump_path '/home/marat/source-code-summarization/transcoder/transcoder/data/test_dataset/output_folder' --lambda_mt 1 --epoch_size 100000 --early_stopping false --gelu_activation false --n_layers 6 --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0003,weight_decay=0.01' --validation_metrics _valid_mlm_ppl --eval_bleu false --dropout '0.1' --mt_steps '' --reload_emb '' --batch_size 16 --context_size 0 --word_dropout 0 --reload_model '' --min_count 0 --lgs 'cpp-python' --sample_alpha 0 --word_pred '0.15' --amp 2 --max_batch_size 0 --clip_grad_norm 5 --emb_dim 512 --encoder_only true --beam_size 1 --clm_steps '' --exp_name mlm_cpp_python_with_coms --lambda_ae 1 --lg_sampling_factor '-1' --eval_only false

Current problem:

RuntimeError: CUDA error: out of memory

Reduced model size from 77M to 19M via embeddings, n_heads, moved batch_size from 32 to 16, started training.

Current: Volatile GPU-Util ERR!. Explanation: nvidia-smi is not supported on WSL2 yet.

TransCoder Preprocessing

Obtain JSON's:

{
"repo_name": "coronary/RandomEpisode",
"ref": "refs/heads/master",
"path": "depends/Lib/encodings/cp1006.py",
"content": "SOURCE CODE"
}

Apply tokenization:

<DOCUMENT_ID="ID"> TOKENIZED SOURCE CODE </DOCUMENT>

How does preprocess work:

preprocess -> dataset.process_languages (+ dataset.train_bpe, dataset.apply_bpe, dataset.binarize_for_XLM, dataset.extract_functions_and_apply_bpe) -> Language.process -> Language.process_json_and_tok -> utils.process_and_tokenize_json_file

How does binarization work:

dataset.binarize_for_XLM -> utils.binarize_for_XLM_file -> XLM/preprocess.py

Operating with a class Dictionary located in XLM/src/data/dictionary.py. Static function Dictionary.index_data fulfils the following fields:

data = {
'dico': dico,
'positions': positions,
'sentences': sentences,
'unk_words': unk_words,
}

Where:

dico -- an instance of Dictionary object. Stores id2word, word2id, counts and special tokens indices such as: bos_index, eos_index, pad_index, unk_index.
positions is a np.ndarray of tuples storing (beggining, length) of sentences;
sentences is a np.ndarray storing word indices for every sentence without padding;
unk_words counts a number of occurences of the word if it is unknown.

Afterwards data is saved with torch.save.

A closer look on files structure after preprocessing:

Main suffices are .functions_class and .functions_standalone. Consider samples from both of them:

.functions_class:

def RunNavigateSteps ( self , action_runner ) : NEW_LINE INDENT action_runner . NavigateToPage ( self ) NEW_LINE action_runner . Wait ( 25 ) NEW_LINE DEDENT

.functions_standalone:

def run ( command , ** kwargs ) : NEW_LINE INDENT fail_hard = kwargs . pop ( " fail _ hard " , True ) NEW_LINE # ▁ output ▁ to ▁ / dev / null ▁ by ▁ default : ENDCOM kwargs . setdefault ( " stdout " , open ( ' / dev / null ' , ' w ' ) ) NEW_LINE kwargs . setdefault ( " stderr " , open ( ' / dev / null ' , ' w ' ) ) NEW_LINE command = Template ( command ) . substitute ( os . environ ) NEW_LINE if " TRACE " in os . environ : NEW_LINE INDENT if ' cwd ' in kwargs : NEW_LINE INDENT print ( " [ cwd = % s ] ▁ % s " % ( kwargs [ ' cwd ' ] , command ) ) NEW_LINE DEDENT else : print ( command ) NEW_LINE DEDENT try : NEW_LINE INDENT process = subprocess . Popen ( command . split ( ' ▁ ' ) , ** kwargs ) NEW_LINE process . wait ( ) NEW_LINE DEDENT except KeyboardInterrupt : NEW_LINE INDENT process . terminate ( ) NEW_LINE raise NEW_LINE DEDENT if process . returncode != 0 and fail_hard : NEW_LINE INDENT raise RunError ( " Failed : ▁ " + command ) NEW_LINE DEDENT return process . returncode NEW_LINE DEDENT

standalone refers to the functions defined without class inplacement, functions_class are methods of certain classes.

martysai · 2021-02-14T17:20:06Z

martysai
Feb 14, 2021
Maintainer Author

Train for Python150k

The script python150k/prepare_data.py collects dataset functions and docstrings and separates them into 2 JSON's. Command to run preprocess for Python150k:

python -m preprocessing.preprocess /home/marat/source-code-summarization/python150k/json --lang1 python --lang2 cpp --keep_comments True --bpe_train_size 0 --test_size 10 --local True

0 replies

martysai · 2021-02-19T09:14:49Z

martysai
Feb 19, 2021
Maintainer Author

Currently running the following command to apply BPE on the vocabulary:

/home/marat/source-code-summarization/transcoder/transcoder/XLM/tools/fastBPE/fast applybpe /home/marat/source-code-summarization/python150k/json/cpp-python-.with_comments/cpp.train.with_comments.1.bpe /home/marat/source-code-summarization/python150k/json/cpp/train.with_comments.1.tok /home/marat/source-code-summarization/python150k/json/cpp-python-.with_comments/codes

Getting the following error:

Output memory map failed : 22.

Sizes of generated files like cpp.train.with_comments.0.bpe are empty.

0 replies

martysai · 2021-02-20T16:07:47Z

martysai
Feb 20, 2021
Maintainer Author

Train

A folder which contains binaries for training:

/home/marat/source-code-summarization/python150k/json/cpp-python-.with_comments.XLM-syml

A command for train:

python XLM/train.py --n_heads 1 --bt_steps '' --max_vocab 8000 --word_mask_keep_rand '0.8,0.1,0.1' --word_blank 0 --data_path '/home/marat/source-code-summarization/python150k/json/cpp-python-.with_comments.XLM-syml'  --save_periodic 1  --bptt 512  --lambda_clm 1  --ae_steps '' --fp16 true --share_inout_emb true --lambda_mlm 1 --sinusoidal_embeddings false  --word_shuffle 0  --mlm_steps 'cpp,python'  --attention_dropout 0  --split_data false  --length_penalty 1  --max_epoch 30  --stopping_criterion '_valid_mlm_ppl,10' --lambda_bt 1 --dump_path '/home/marat/source-code-summarization/python150k/json/output_folder' --lambda_mt 1 --epoch_size 1000 --early_stopping false --gelu_activation false --n_layers 6 --optimizer 'adam_inverse_sqrt,warmup_updates=10000,lr=0.0003,weight_decay=0.01' --validation_metrics _valid_mlm_ppl --eval_bleu true --dropout '0.1' --mt_steps '' --reload_emb '' --batch_size 16 --context_size 0 --word_dropout 0 --reload_model '' --min_count 0 --lgs 'cpp-python' --sample_alpha 0 --word_pred '0.15' --amp 2 --max_batch_size 0 --clip_grad_norm 5 --emb_dim 512 --encoder_only true --beam_size 1 --clm_steps '' --exp_name mlm_cpp_python_with_coms --lambda_ae 1 --lg_sampling_factor '-1' --eval_only false

`XLM/train.py`

We create SingleTrainer and SingleEvaluator instances. Then we move to set_sampling_probs for training. Then we perform CLM and MLM steps with trainer.clm_step and trainer.mlm_step functions. Then we runmt_step as machine translation and bt_step as back translation. Afterwards we run trainer.iter().

A closer look:

`clm_step`:

generate_batch:
round_batch:
model('fwd'), model('predict')
optimize
self.stats represents losses and metrices.

Checkpoint from the end of the epoch:

dict_keys(['epoch', 'n_total_iter', 'best_metrics', 'best_stopping_criterion', 'model', 'dico_id2word', 'dico_word2id', 'dico_counts', 'params'])

The most interesting part is checkpoint['best_metrics']['valid_mlm_ppl'].

Evaluation:

evaluate_clm, evaluate_mlm, evaluate_mt: Evaluate perplexity and next word prediction accuracy.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Exploratory Data Analysis #6

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 3 comments

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Exploratory Data Analysis #6

Uh oh!

Uh oh!

martysai Feb 5, 2021 Maintainer

TransCoder Preprocessing

A closer look on files structure after preprocessing:

Replies: 3 comments

Uh oh!

martysai Feb 14, 2021 Maintainer Author

Train for Python150k

Uh oh!

martysai Feb 19, 2021 Maintainer Author

Uh oh!

Uh oh!

martysai Feb 20, 2021 Maintainer Author

Train

XLM/train.py

clm_step:

Evaluation:

martysai
Feb 5, 2021
Maintainer

martysai
Feb 14, 2021
Maintainer Author

martysai
Feb 19, 2021
Maintainer Author

martysai
Feb 20, 2021
Maintainer Author

`XLM/train.py`

`clm_step`: