@@ -282,7 +282,7 @@ have to reconstruct those words from context. We call this a "masked LM" but it
282
282
python 3+ tensorflow 1.10
283
283
284
284
## Implementation Details
285
- 1 . what share and not share beteween pre-train and fine-tuning stages?
285
+ 1 . what share and not share between pre-train and fine-tuning stages?
286
286
287
287
1).basically, all of parameters of backbone network used by pre-train and fine-tuning stages are shared each other.
288
288
@@ -296,7 +296,7 @@ python 3+ tensorflow 1.10
296
296
297
297
to make things easily, we generate sentences from documents, split them into sentences. for each sentence
298
298
299
- we trancuate and padding it to same length, and random select a word, then replace it with [ MASK] , its self and a random
299
+ we truncate and padding it to same length, and randomly select a word, then replace it with [ MASK] , its self and a random
300
300
301
301
word.
302
302
@@ -308,7 +308,7 @@ python 3+ tensorflow 1.10
308
308
309
309
1 . why we need self-attention?
310
310
311
- self-attention a new type of network recently gain more and more attention. traditonally we use
311
+ self-attention a new type of network recently gain more and more attention. traditionally we use
312
312
313
313
rnn or cnn to solve problem. however rnn has a problem in parallel, and cnn is not good at model position sensitive tasks.
314
314
@@ -317,13 +317,13 @@ python 3+ tensorflow 1.10
317
317
318
318
2 . what is multi-heads self-attention, what does q,k,v stand for? add something here.
319
319
320
- mulit-heads self-attention is a self-attention, while it divide and project q and k into serveral different subspace,
320
+ mulit-heads self-attention is a self-attention, while it divide and project q and k into several different subspace,
321
321
322
322
then do attention.
323
323
324
324
q stand for question, k stand for keys. for machine translation task, q is previous hidden state of decodes, k represent
325
325
326
- hidden states of encoder. each of element of k will compute a similiarity score with q. and then softmax will be used
326
+ hidden states of encoder. each of element of k will compute a similarity score with q. and then softmax will be used
327
327
328
328
to do normalize score, we will get weights. finally a weighted sum is computed by using weights apply to v.
329
329
@@ -367,7 +367,7 @@ inside model/transform_model.py, there is a train and predict method.
367
367
368
368
first you can run train() to start training, then run predict() to start prediction using trained model.
369
369
370
- as the model is pretty big, with default hyperparamter(d_model=512, h=8,d_v=d_k=64,num_layer=6), it require lots of data before it can converage .
370
+ as the model is pretty big, with default hyperparamter(d_model=512, h=8,d_v=d_k=64,num_layer=6), it require lots of data before it can converge .
371
371
372
372
at least 10k steps is need, before loss become less than 0.1. if you want to train it fast with small
373
373
0 commit comments