Replies: 3 comments
-
Train for Python150kThe script
|
Beta Was this translation helpful? Give feedback.
-
Currently running the following command to apply BPE on the vocabulary:
Getting the following error:
Sizes of generated files like |
Beta Was this translation helpful? Give feedback.
-
TrainA folder which contains binaries for training:
A command for train:
|
Beta Was this translation helpful? Give feedback.
-
To use TransCoder it is necessary to understand the data. After Python150k preprocessing we get functions:
And their docstrings:
The following TO-DO list describes how to handle this data with TransCoder:
.tok
and.pth
data formats;Consider BigQuery dataset described here: TransCoder.
Let's consider
src/data/loader.py
for getting intovalid.python.pth
inner contents.Same way we can consider an example of python dataset in the TransCoder:
/source-code-summarization/transcoder/transcoder/data/test_dataset/python
After running
pytest preprocessing/test_preprocess.py
I got the following type of data:In order to run preprocess:
Current problem:
Fixed: store both JSONs in
gzip
-compressed mode.Then after preprocess we get the following
.XLM-syml
folder:Now transform it to be put onto single GPU. That should be done at the end, see issue.
For now just create a symlink from
train.python.0.pth
totrain.python.pth
.A command to pretrain with MLM:
Current problem:
Reduced model size from 77M to 19M via embeddings,
n_heads
, movedbatch_size
from 32 to 16, started training.Current:
Volatile GPU-Util ERR!
. Explanation:nvidia-smi
is not supported on WSL2 yet.TransCoder Preprocessing
Obtain JSON's:
Apply tokenization:
How does preprocess work:
How does binarization work:
Operating with a class
Dictionary
located inXLM/src/data/dictionary.py
. Static functionDictionary.index_data
fulfils the following fields:Where:
dico
-- an instance ofDictionary
object. Storesid2word
,word2id
,counts
and special tokens indices such as:bos_index
,eos_index
,pad_index
,unk_index
.positions
is anp.ndarray
of tuples storing(beggining, length)
of sentences;sentences
is anp.ndarray
storing word indices for every sentence without padding;unk_words
counts a number of occurences of the word if it is unknown.Afterwards
data
is saved withtorch.save
.A closer look on files structure after preprocessing:
Main suffices are
.functions_class
and.functions_standalone
. Consider samples from both of them:.functions_class
:.functions_standalone
:standalone
refers to the functions defined without class inplacement,functions_class
are methods of certain classes.Beta Was this translation helpful? Give feedback.
All reactions