Replies: 3 comments
-
Train for Python150kThe script |
Beta Was this translation helpful? Give feedback.
-
|
Currently running the following command to apply BPE on the vocabulary: Getting the following error: Sizes of generated files like |
Beta Was this translation helpful? Give feedback.
-
TrainA folder which contains binaries for training: A command for train:
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
To use TransCoder it is necessary to understand the data. After Python150k preprocessing we get functions:
And their docstrings:
The following TO-DO list describes how to handle this data with TransCoder:
.tokand.pthdata formats;Consider BigQuery dataset described here: TransCoder.
Let's consider
src/data/loader.pyfor getting intovalid.python.pthinner contents.Same way we can consider an example of python dataset in the TransCoder:
/source-code-summarization/transcoder/transcoder/data/test_dataset/pythonAfter running
pytest preprocessing/test_preprocess.pyI got the following type of data:In order to run preprocess:
Current problem:
Fixed: store both JSONs in
gzip-compressed mode.Then after preprocess we get the following
.XLM-symlfolder:Now transform it to be put onto single GPU. That should be done at the end, see issue.
For now just create a symlink from
train.python.0.pthtotrain.python.pth.A command to pretrain with MLM:
Current problem:
Reduced model size from 77M to 19M via embeddings,
n_heads, movedbatch_sizefrom 32 to 16, started training.Current:
Volatile GPU-Util ERR!. Explanation:nvidia-smiis not supported on WSL2 yet.TransCoder Preprocessing
Obtain JSON's:
Apply tokenization:
How does preprocess work:
How does binarization work:
Operating with a class
Dictionarylocated inXLM/src/data/dictionary.py. Static functionDictionary.index_datafulfils the following fields:Where:
dico-- an instance ofDictionaryobject. Storesid2word,word2id,countsand special tokens indices such as:bos_index,eos_index,pad_index,unk_index.positionsis anp.ndarrayof tuples storing(beggining, length)of sentences;sentencesis anp.ndarraystoring word indices for every sentence without padding;unk_wordscounts a number of occurences of the word if it is unknown.Afterwards
datais saved withtorch.save.A closer look on files structure after preprocessing:
Main suffices are
.functions_classand.functions_standalone. Consider samples from both of them:.functions_class:.functions_standalone:standalonerefers to the functions defined without class inplacement,functions_classare methods of certain classes.Beta Was this translation helpful? Give feedback.
All reactions