-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathTutorial2_Finetune_a_model_on_your_data.py
56 lines (46 loc) · 2.91 KB
/
Tutorial2_Finetune_a_model_on_your_data.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
# # Fine-tuning a model on your own data
#
# For many use cases it is sufficient to just use one of the existing public models that were trained on SQuAD or
# other public QA datasets (e.g. Natural Questions).
# However, if you have domain-specific questions, fine-tuning your model on custom examples will very likely boost
# your performance. While this varies by domain, we saw that ~ 2000 examples can easily increase performance by +5-20%.
#
# This tutorial shows you how to fine-tune a pretrained model on your own dataset.
from haystack.nodes import FARMReader
def tutorial2_finetune_a_model_on_your_data():
# ## Create Training Data
#
# There are two ways to generate training data
#
# 1. **Annotation**: You can use the annotation tool(https://github.com/deepset-ai/haystack#labeling-tool) to label
# your data, i.e. highlighting answers to your questions in a document. The tool supports structuring
# your workflow with organizations, projects, and users. The labels can be exported in SQuAD format
# that is compatible for training with Haystack.
#
# 2. **Feedback**: For production systems, you can collect training data from direct user feedback via Haystack's
# REST API interface. This includes a customizable user feedback API for providing feedback on the
# answer returned by the API. The API provides a feedback export endpoint to obtain the feedback data
# for fine-tuning your model further.
#
#
# ## Fine-tune your model
#
# Once you have collected training data, you can fine-tune your base models.
# We initialize a reader as a base model and fine-tune it on our own custom dataset (should be in SQuAD-like format).
# We recommend using a base model that was trained on SQuAD or a similar QA dataset before to benefit from Transfer
# Learning effects.
#**Recommendation: Run training on a GPU. To do so change the `use_gpu` arguments below to `True`
reader = FARMReader(model_name_or_path="distilbert-base-uncased-distilled-squad", use_gpu=True)
data_dir = "data/squad20"
# data_dir = "PATH/TO_YOUR/TRAIN_DATA"
reader.train(data_dir=data_dir, train_filename="dev-v2.0.json", use_gpu=True, n_epochs=1, save_dir="my_model")
# Saving the model happens automatically at the end of training into the `save_dir` you specified
# However, you could also save a reader manually again via:
reader.save(directory="my_model")
# If you want to load it at a later point, just do:
new_reader = FARMReader(model_name_or_path="my_model")
if __name__ == "__main__":
tutorial2_finetune_a_model_on_your_data()
# This Haystack script was made with love by deepset in Berlin, Germany
# Haystack: https://github.com/deepset-ai/haystack
# deepset: https://deepset.ai/