Skip to content

Commit 2975b51

Browse files
committed
update
1 parent 9bf7402 commit 2975b51

6 files changed

+103
-20
lines changed

README.md

+41-8
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,17 @@
11
# bert_language_understanding
22
Pre-train is all you need!
33

4-
An tensorflow implementation of Pre-training of Deep Bidirectional Transformers for Language Understanding
4+
BERT achieve new state of art result on more than 10 nlp tasks recently.
55

6-
(Bert) and Attention is all you need(Transformer). BERT achieve new state of art result on more than 10 nlp tasks recently.
6+
This is an tensorflow implementation of Pre-training of Deep Bidirectional Transformers for Language Understanding
7+
8+
(Bert) and Attention is all you need(Transformer).
79

810
Update: The majority part of replicate main ideas of these two papers was done, there is a apparent performance gain
911

1012
for pre-train a model & fine-tuning compare to train the model from scratch.
1113

14+
1215
We have done experiment to replace backbone network of bert from Transformer to TextCNN, and the result is that
1316

1417
pre-train the model with masked language model using lots of raw data can boost performance in a notable amount.
@@ -17,7 +20,14 @@ More generally, we believe that pre-train and fine-tuning strategy is model inde
1720

1821
with that being said, you can replace backbone network as you like. and add more pre-train tasks or define some new pre-train tasks as
1922

20-
you can, pre-train will not be limited to masked language model and or predict next sentence task.
23+
you can, pre-train will not be limited to masked language model and or predict next sentence task. What surprise us is that,
24+
25+
with a middle size data set that say, one million, even without use external data, with the help of pre-train task like masked language
26+
27+
model, performance can be boost in a big margin, and the model can converge even fast. sometime training can be in a only need a few epoch
28+
29+
in fine-tuning stage.
30+
2131

2232
While there is an open source(<a href='https://github.com/tensorflow/tensor2tensor'>tensor2tensor</a>) and official
2333

@@ -125,9 +135,13 @@ if you want to try BERT with pre-train of masked language model and fine-tuning.
125135
or want to train a small model, use d_model=128,h=8,d_k=d_v=16(small), or d_model=64,h=8,d_k=d_v=8(tiny).
126136

127137

128-
## Data Format and Sample Data
138+
## Sample Data, Data Format & Suggestion to User
129139

130-
##### for train transform:
140+
##### for pre-train stage
141+
each line is document(several sentences) or a sentence. that is free-text you can get easily.
142+
143+
144+
##### for data used on fine-tuning stage:
131145

132146
input and output is in the same line, each label is start with '__label__'.
133147

@@ -138,11 +152,30 @@ token1 token2 token3 __label__l1 __label__l5 __label__l3
138152

139153
token1 token2 token3 __label__l2 __label__l4
140154

141-
##### for pre-train masked language with BERT:
142155

143-
each line is a sentence or serveral sentences( that is raw data you can get easily)
156+
check 'data' folder for sample data. <a href='https://pan.baidu.com/s/1HUzBXB_-zzqv-abWZ74w2Q'>down load a middle size data set here,
157+
158+
with 450k 206 classes</a>each input is a document, average length is around 300, one or multi-label associate with input.
159+
160+
##### Suggestion to User
144161

145-
check 'data' folder for sample data.
162+
1. things can be easy: 1) download dataset(around 200M),2) run step 1 for pre-train, 3) and run step 2 for fine-tuning.
163+
164+
2. i finish above three steps, and want to have a better performance, how can i do further. do i need to find a big dataset?
165+
166+
No. you can generate a big data set yourself for pre-train stage by downloading some free-text, make sure each line is a
167+
168+
document or sentence then replace data/bert_train2.txt with your new data file.
169+
170+
3. what's more?
171+
172+
try some big hyper-parameter or big model(by replacing backbone network) util you can observe all your pre-train data.
173+
174+
play around with model:model/bert_cnn_model.py, or check pre-process with data_util_hdf5.py.
175+
176+
177+
178+
##### for pre-train masked language with BERT:
146179

147180

148181
## Pretrain Language Understanding Task

data_util_hdf5.py

+2
Original file line numberDiff line numberDiff line change
@@ -301,6 +301,8 @@ def get_lable2index(data_path,training_data_path,tokenize_style='word'):
301301
return pickle.load(data_f)
302302
file_object = codecs.open(training_data_path, mode='r', encoding='utf-8')
303303
lines=file_object.readlines()
304+
random.shuffle(lines)
305+
lines=lines[0:60000] # only read 100k lines to make training fast
304306
c_labels=Counter()
305307
for i,line in enumerate(lines):
306308
_,input_label=get_input_strings_and_labels(line, tokenize_style=tokenize_style)

model/bert_cnn_model.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -12,7 +12,7 @@
1212
from model.encoder import Encoder
1313
from model.config import Config
1414
import os
15-
os.environ["CUDA_VISIBLE_DEVICES"] = "7"
15+
#os.environ["CUDA_VISIBLE_DEVICES"] = "7"
1616

1717
class BertCNNModel:
1818
def __init__(self,config):

temp_covert.py

+48
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
# -*- coding: utf-8 -*-
2+
import json
3+
import random
4+
5+
dict_unique={}
6+
7+
dict_type_ignore_count={'train':0,'valid':0,'test':0}
8+
def transform_data_to_fasttext_format(file_path,target_path,data_type):
9+
file_object=open(file_path,'r')
10+
target_object=open(target_path,'w')
11+
lines=file_object.readlines()
12+
print("length of lines:",len(lines))
13+
random.shuffle(lines)
14+
for i,line in enumerate(lines):
15+
json_string=json.loads(line)
16+
accusation_list=json_string['meta']['accusation']
17+
fact=json_string['fact'].strip('\n\r').replace("\n","").replace("\r","")
18+
unique_value=dict_unique.get(fact,None)
19+
if unique_value is None: # if not exist, put to unique dict, then process
20+
dict_unique[fact] = fact
21+
else: # otherwise, ignore
22+
print("going to ignore.",data_type,fact)
23+
dict_type_ignore_count[data_type]=dict_type_ignore_count[data_type]+1
24+
continue
25+
length_accusation=len(accusation_list)
26+
#if length_accusation>1:
27+
#print("accusation_list:",str(accusation_list))
28+
#print("json_string:",json_string)
29+
accusation_strings=''
30+
for i,accusation in enumerate(accusation_list):
31+
accusation_strings+=' __label__'+accusation
32+
target_object.write(fact+accusation_strings+"\n")
33+
target_object.close()
34+
file_object.close()
35+
print("dict_type_ignore_count:",dict_type_ignore_count[data_type])
36+
37+
file_path='./data/cail2018/data_valid_checked.json'
38+
target_path='./data/data_valid2.txt'
39+
transform_data_to_fasttext_format(file_path,target_path,'valid')
40+
41+
file_path='./data/cail2018/data_test.json'
42+
target_path='./data/data_test2.txt'
43+
transform_data_to_fasttext_format(file_path,target_path,'test')
44+
45+
file_path='./data/cail2018/cail2018_big_downsmapled.json'
46+
target_path='./data/data_train2.txt'
47+
transform_data_to_fasttext_format(file_path,target_path,'train')
48+

train_bert_fine_tuning.py

+9-9
Original file line numberDiff line numberDiff line change
@@ -23,9 +23,9 @@
2323
FLAGS=tf.app.flags.FLAGS
2424

2525
tf.app.flags.DEFINE_string("data_path","./data/","path of traning data.")
26-
tf.app.flags.DEFINE_string("training_data_file","./data/bert_train.txt","path of traning data.") #./data/cail2018_bi.json
27-
tf.app.flags.DEFINE_string("valid_data_file","./data/bert_train.txt","path of validation data.")
28-
tf.app.flags.DEFINE_string("test_data_file","./data/bert_test.txt","path of validation data.")
26+
tf.app.flags.DEFINE_string("training_data_file","./data/bert_train2.txt","path of traning data.") #./data/cail2018_bi.json
27+
tf.app.flags.DEFINE_string("valid_data_file","./data/bert_valid2.txt","path of validation data.")
28+
tf.app.flags.DEFINE_string("test_data_file","./data/bert_test2.txt","path of validation data.")
2929
tf.app.flags.DEFINE_string("ckpt_dir","./checkpoint_lm/","checkpoint location for the model for restore from pre-train") #save to here, so make it easy to upload for test
3030
tf.app.flags.DEFINE_string("ckpt_dir_save","./checkpoint_lm_save/","checkpoint location for the model for save fine-tuning") #save to here, so make it easy to upload for test
3131

@@ -35,21 +35,21 @@
3535
tf.app.flags.DEFINE_float("learning_rate",0.00001,"learning rate") #0.001
3636
tf.app.flags.DEFINE_integer("batch_size", 64, "Batch size for training/evaluating.") # 32-->128
3737
tf.app.flags.DEFINE_integer("decay_steps", 10000, "how many steps before decay learning rate.") # 32-->128
38-
tf.app.flags.DEFINE_float("decay_rate", 0.9, "Rate of decay for learning rate.") #0.65
38+
tf.app.flags.DEFINE_float("decay_rate", 0.8, "Rate of decay for learning rate.") #0.65
3939
tf.app.flags.DEFINE_float("dropout_keep_prob", 0.9, "percentage to keep when using dropout.") #0.65
4040
tf.app.flags.DEFINE_integer("sequence_length",200,"max sentence length")#400
4141
tf.app.flags.DEFINE_integer("sequence_length_lm",10,"max sentence length for masked language model")
4242

4343
tf.app.flags.DEFINE_boolean("is_training",True,"is training.true:tranining,false:testing/inference")
4444
tf.app.flags.DEFINE_boolean("is_fine_tuning",True,"is_finetuning.ture:this is fine-tuning stage")
4545

46-
tf.app.flags.DEFINE_integer("num_epochs",30,"number of epochs to run.")
47-
tf.app.flags.DEFINE_integer("process_num",3,"number of cpu used")
46+
tf.app.flags.DEFINE_integer("num_epochs",35,"number of epochs to run.")
47+
tf.app.flags.DEFINE_integer("process_num",35,"number of cpu used")
4848

4949
tf.app.flags.DEFINE_integer("validate_every", 1, "Validate every validate_every epochs.") #
5050
tf.app.flags.DEFINE_boolean("use_pretrained_embedding",False,"whether to use embedding or not.")#
5151
tf.app.flags.DEFINE_string("word2vec_model_path","./data/Tencent_AILab_ChineseEmbedding_100w.txt","word2vec's vocabulary and vectors") # data/sgns.target.word-word.dynwin5.thr10.neg5.dim300.iter5--->data/news_12g_baidubaike_20g_novel_90g_embedding_64.bin--->sgns.merge.char
52-
tf.app.flags.DEFINE_boolean("test_mode",True,"whether it is test mode. if it is test mode, only small percentage of data will be used. test mode for test purpose.")
52+
tf.app.flags.DEFINE_boolean("test_mode",False,"whether it is test mode. if it is test mode, only small percentage of data will be used. test mode for test purpose.")
5353

5454
tf.app.flags.DEFINE_integer("d_model", 64, "dimension of model") # 512-->128
5555
tf.app.flags.DEFINE_integer("num_layer", 6, "number of layer")
@@ -81,7 +81,7 @@ def main(_):
8181
if os.path.exists(FLAGS.ckpt_dir+"checkpoint"):
8282
print("Restoring Variables from Checkpoint.")
8383
sess.run(tf.global_variables_initializer())
84-
for i in range(2): #decay learning rate if necessary.
84+
for i in range(6): #decay learning rate if necessary.
8585
print(i,"Going to decay learning rate by a factor of "+str(FLAGS.decay_rate))
8686
sess.run(model.learning_rate_decay_half_op)
8787
# restore those variables that names and shapes exists in your model from checkpoint. for detail check: https://gist.github.com/iganichev/d2d8a0b1abc6b15d4a07de83171163d4
@@ -110,7 +110,7 @@ def main(_):
110110
current_loss,lr,l2_loss,_=sess.run([model.loss_val,model.learning_rate,model.l2_loss,model.train_op],feed_dict)
111111
loss_total,counter=loss_total+current_loss,counter+1
112112
if counter %30==0:
113-
print("Learning rate:%.5f\tLoss:%.3f\tCurrent_loss:%.3f\tL2_loss%.3f\t"%(lr,float(loss_total)/float(counter),current_loss,l2_loss))
113+
print("Learning rate:%.7f\tLoss:%.3f\tCurrent_loss:%.3f\tL2_loss%.3f\t"%(lr,float(loss_total)/float(counter),current_loss,l2_loss))
114114
if start!=0 and start%(4000*FLAGS.batch_size)==0:
115115
loss_valid, f1_macro_valid, f1_micro_valid= do_eval(sess, model, valid,num_classes,label2index)
116116
f1_score_valid=((f1_macro_valid+f1_micro_valid)/2.0) #*100.0

train_bert_lm.py

+2-2
Original file line numberDiff line numberDiff line change
@@ -23,7 +23,7 @@
2323
#configuration
2424
FLAGS=tf.app.flags.FLAGS
2525

26-
tf.app.flags.DEFINE_boolean("test_mode",True,"whether it is test mode. if it is test mode, only small percentage of data will be used")
26+
tf.app.flags.DEFINE_boolean("test_mode",False,"whether it is test mode. if it is test mode, only small percentage of data will be used")
2727
tf.app.flags.DEFINE_string("data_path","./data/","path of traning data.")
2828
tf.app.flags.DEFINE_string("mask_lm_source_file","./data/bert_train2.txt","path of traning data.")
2929
tf.app.flags.DEFINE_string("ckpt_dir","./checkpoint_lm/","checkpoint location for the model") #save to here, so make it easy to upload for test
@@ -49,7 +49,7 @@
4949
tf.app.flags.DEFINE_integer("validate_every", 1, "Validate every validate_every epochs.")
5050
tf.app.flags.DEFINE_boolean("use_pretrained_embedding",False,"whether to use embedding or not.")#
5151
tf.app.flags.DEFINE_string("word2vec_model_path","./data/Tencent_AILab_ChineseEmbedding_100w.txt","word2vec's vocabulary and vectors") # data/sgns.target.word-word.dynwin5.thr10.neg5.dim300.iter5--->data/news_12g_baidubaike_20g_novel_90g_embedding_64.bin--->sgns.merge.char
52-
tf.app.flags.DEFINE_integer("process_num",20,"number of cpu process")
52+
tf.app.flags.DEFINE_integer("process_num",35,"number of cpu process")
5353

5454
def main(_):
5555
vocab_word2index, _= create_or_load_vocabulary(FLAGS.data_path,FLAGS.mask_lm_source_file,FLAGS.vocab_size,test_mode=FLAGS.test_mode,tokenize_style=FLAGS.tokenize_style)

0 commit comments

Comments
 (0)