Skip to content

Commit f187e12

Browse files
committed
2 parents 0e38c84 + 0905cf4 commit f187e12

File tree

1 file changed

+6
-6
lines changed

1 file changed

+6
-6
lines changed

README.md

Lines changed: 6 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -282,7 +282,7 @@ have to reconstruct those words from context. We call this a "masked LM" but it
282282
python 3+ tensorflow 1.10
283283

284284
## Implementation Details
285-
1. what share and not share beteween pre-train and fine-tuning stages?
285+
1. what share and not share between pre-train and fine-tuning stages?
286286

287287
1).basically, all of parameters of backbone network used by pre-train and fine-tuning stages are shared each other.
288288

@@ -296,7 +296,7 @@ python 3+ tensorflow 1.10
296296

297297
to make things easily, we generate sentences from documents, split them into sentences. for each sentence
298298

299-
we trancuate and padding it to same length, and random select a word, then replace it with [MASK], its self and a random
299+
we truncate and padding it to same length, and randomly select a word, then replace it with [MASK], its self and a random
300300

301301
word.
302302

@@ -308,7 +308,7 @@ python 3+ tensorflow 1.10
308308

309309
1. why we need self-attention?
310310

311-
self-attention a new type of network recently gain more and more attention. traditonally we use
311+
self-attention a new type of network recently gain more and more attention. traditionally we use
312312

313313
rnn or cnn to solve problem. however rnn has a problem in parallel, and cnn is not good at model position sensitive tasks.
314314

@@ -317,13 +317,13 @@ python 3+ tensorflow 1.10
317317

318318
2. what is multi-heads self-attention, what does q,k,v stand for? add something here.
319319

320-
mulit-heads self-attention is a self-attention, while it divide and project q and k into serveral different subspace,
320+
mulit-heads self-attention is a self-attention, while it divide and project q and k into several different subspace,
321321

322322
then do attention.
323323

324324
q stand for question, k stand for keys. for machine translation task, q is previous hidden state of decodes, k represent
325325

326-
hidden states of encoder. each of element of k will compute a similiarity score with q. and then softmax will be used
326+
hidden states of encoder. each of element of k will compute a similarity score with q. and then softmax will be used
327327

328328
to do normalize score, we will get weights. finally a weighted sum is computed by using weights apply to v.
329329

@@ -367,7 +367,7 @@ inside model/transform_model.py, there is a train and predict method.
367367

368368
first you can run train() to start training, then run predict() to start prediction using trained model.
369369

370-
as the model is pretty big, with default hyperparamter(d_model=512, h=8,d_v=d_k=64,num_layer=6), it require lots of data before it can converage.
370+
as the model is pretty big, with default hyperparamter(d_model=512, h=8,d_v=d_k=64,num_layer=6), it require lots of data before it can converge.
371371

372372
at least 10k steps is need, before loss become less than 0.1. if you want to train it fast with small
373373

0 commit comments

Comments
 (0)