Skip to content

Commit a91263b

Browse files
authored
Merge pull request #1800 from oneapi-src/release/2023.2
2023.2 Release
2 parents 75fbc01 + 700a729 commit a91263b

File tree

1,520 files changed

+223103
-30462
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

1,520 files changed

+223103
-30462
lines changed

Diff for: AI-and-Analytics/Features-and-Functionality/INC_QuantizationAwareTraining_TextClassification/INC_QuantizationAwareTraining_TextClassification.ipynb

+1,110
Large diffs are not rendered by default.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,253 @@
1+
#!/usr/bin/env python
2+
# coding: utf-8
3+
4+
# In[ ]:
5+
6+
7+
# =============================================================
8+
# Copyright © 2023 Intel Corporation
9+
#
10+
# SPDX-License-Identifier: MIT
11+
# =============================================================
12+
13+
14+
# # Fine-tuning text classification model with Intel® Neural Compressor (INC) Quantization Aware Training
15+
#
16+
# This code sample will show you how to fine tune BERT text model for text multi-class classification task using Quantization Aware Training provided as part of Intel® Neural Compressor (INC).
17+
#
18+
# Before we start, please make sure you have installed all necessary libraries to run this code sample.
19+
20+
# ## Loading model
21+
#
22+
# We decided to use really small model for this code sample which is `prajjwal1/bert-tiny` but please feel free to use different model changing `model_id` to other name form Hugging Face library or your local model path (if it is compatible with Hugging Face API).
23+
#
24+
# Keep in mind that using bigger models like `bert-base-uncased` can improve the final result of the classification after fine-tuning process but it is also really resources and time consuming.
25+
26+
# In[ ]:
27+
28+
29+
from transformers import AutoModelForSequenceClassification, AutoTokenizer
30+
31+
model_id = "prajjwal1/bert-tiny"
32+
model = AutoModelForSequenceClassification.from_pretrained(model_id, num_labels=6)
33+
tokenizer = AutoTokenizer.from_pretrained(model_id, model_max_length=512)
34+
35+
# The directory where the quantized model will be saved
36+
save_dir = "quantized_model"
37+
38+
39+
# ## Dataset
40+
#
41+
# We are using `emotion` [dataset form Hugging Face](https://huggingface.co/datasets/dair-ai/emotion). This dataset has 2 different configurations - **split** and **unsplit**.
42+
#
43+
# In this code sample we are using split configuration. It contains in total 20 000 examples split into train (16 000 texts), test (2 000 texts) and validation (2 000 text) datasets. We decided to use split dataset instead of unsplit configuration as it contains over 400 000 texts which is overkill for fine-tuning.
44+
#
45+
# After loading selected dataset we will take a look at first 10 rows of train dataset. You can always change the dataset for different one, just remember to change also number of labels parameter provided when loading the model.
46+
47+
# In[ ]:
48+
49+
50+
from datasets import load_dataset
51+
52+
dataset = load_dataset("emotion", name="split")
53+
dataset['train'][:10]
54+
55+
56+
# Dataset contains 6 different labels represented by digits from 0 to 5. Every digit symbolizes different emotion as followed:
57+
#
58+
# * 0 - sadness
59+
# * 1 - joy
60+
# * 2 - love
61+
# * 3 - anger
62+
# * 4 - fear
63+
# * 5 - surprise
64+
#
65+
# In the cell below we conducted few computations on training dataset to better understand how the data looks like. We are analyzing only train dataset as the test and validation datasets have similar data distribution.
66+
#
67+
# As you can see, distribution opf classed in dataset is not equal. Having in mind that the train, test and validation distributions are similar this is not a problem for our case.
68+
69+
70+
# In[ ]:
71+
72+
73+
import matplotlib.pyplot as plt
74+
import pandas as pd
75+
76+
sadness = dataset['train']['label'].count(0)
77+
joy = dataset['train']['label'].count(1)
78+
love = dataset['train']['label'].count(2)
79+
anger = dataset['train']['label'].count(3)
80+
fear = dataset['train']['label'].count(4)
81+
surprise = dataset['train']['label'].count(5)
82+
83+
84+
fig = plt.figure()
85+
ax = fig.add_axes([0,0,1,1])
86+
labels = ['joy', 'sadness', 'anger', 'fear', 'love', 'surprise']
87+
frames = [joy, sadness, anger, fear, love, surprise]
88+
ax.bar(labels, frames)
89+
plt.show()
90+
91+
92+
# # Tokenization
93+
#
94+
# Next step is to tokenize the dataset.
95+
#
96+
# **Tokenization** is a way of separating a piece of text into smaller units called tokens. Here, tokens can be either words, characters etc. It means that tokenizer breaks unstructured data (natural language text) into chunks of information that can be considered as discrete elements. The tokens can be used later in a vector representation of that document.
97+
#
98+
# In other words tokenization change an text document into a numerical data structure suitable for machine and deep learning.
99+
#
100+
# To do that, we created function that takes every text from dataset and tokenize it with maximum token length being 128. After that we can se how the structure of the dataset change.
101+
102+
103+
# In[ ]:
104+
105+
106+
def tokenize_data(example):
107+
return tokenizer(example['text'], padding=True, max_length=128)
108+
109+
dataset = dataset.map(tokenize_data, batched=True)
110+
dataset
111+
112+
113+
# Before we start fine-tuning, let's see how the model in current state performs against validation dataset.
114+
#
115+
# First, we need to prepare metrics showing model performance. We decided to use accuracy as a performance measure in this specific task. As the model was not created for this specific task, we can assume that the accuracy will not be high.
116+
117+
118+
# In[ ]:
119+
120+
121+
import evaluate
122+
import numpy as np
123+
124+
metric = evaluate.load("accuracy")
125+
126+
def compute_metrics(eval_pred):
127+
logits, labels = eval_pred
128+
predictions = np.argmax(logits, axis=-1)
129+
return metric.compute(predictions=predictions, references=labels)
130+
131+
132+
# Then, we are creating evaluator to see how pre-trained model classify emotions.
133+
# We have to specify:
134+
# * model on which the evaluation will happen - provide the same `model_id` as before,
135+
# * dataset - in our case this is validation dataset,
136+
# * metrics - as specified before, in our case accuracy,
137+
# * label mapping - to map label names with corresponding digits.
138+
#
139+
# After the evaluation, we just show the results, which are as expected not the best. At this point model is not prepared for emotion classification task.
140+
141+
142+
# In[ ]:
143+
144+
145+
from evaluate import evaluator
146+
147+
task_evaluator = evaluator("text-classification")
148+
149+
eval_results = task_evaluator.compute(
150+
model_or_pipeline=model_id,
151+
data=dataset['validation'],
152+
metric=metric,
153+
label_mapping={"LABEL_0": 0, "LABEL_1": 1, "LABEL_2": 2, "LABEL_3": 3, "LABEL_4": 4, "LABEL_5": 5}
154+
)
155+
eval_results
156+
157+
158+
# # Quantization Aware Training
159+
#
160+
# Now, we can move to fine-tuning with quantization. But first, please review the definition of quantization and quantization aware training.
161+
#
162+
# **Quantization** is a systematic reduction of the precision of all or several layers within the model. This means, a higher-precision type, such as the single precision floating-point (FP32) is converted into a lower-precision type, such as FP16 (16 bits) or INT8 (8 bits).
163+
#
164+
# **Quantization Aware Training** replicates inference-time quantization, resulting in a model that downstream tools may utilize to generate actually quantized models. In other words, it provides quantization to the model during training (or fine-tuning like in our case) based on provided quantization configuration.
165+
#
166+
# Having that in mind, we can provide configuration for the Quantization Aware Training form Intel® Neural Compressor.
167+
168+
169+
# In[ ]:
170+
171+
172+
from neural_compressor import QuantizationAwareTrainingConfig
173+
174+
# The configuration detailing the quantization process
175+
quantization_config = QuantizationAwareTrainingConfig()
176+
177+
178+
# The next step is to create trainer for our model. We will use Intel® Neural Compressor optimize trainer form `optimum.intel` package.
179+
# We need to provide all necessary parameters to the trainer:
180+
#
181+
# * initialized model and tokenizer
182+
# * configuration for quantization aware training
183+
# * training arguments that includes: directory where model will be saved, number of epochs
184+
# * datasets for training and evaluation
185+
# * prepared metrics that allow us to see the progress in training
186+
#
187+
# For purpose of this code sample, we decided to train model by just 2 epochs, to show you how the quantization aware training works and that the fine-tuning really improves the results of the classification. If you wan to receive better accuracy results, yoy can easily incise the number of epochs up to 5 and observe how model learns. Keep in mind that the process may take some time - the more epochs you will use, the training time will be longer.
188+
189+
190+
# In[ ]:
191+
192+
193+
from optimum.intel import INCModelForSequenceClassification, INCTrainer
194+
from transformers import TrainingArguments
195+
196+
trainer = INCTrainer(
197+
model=model,
198+
quantization_config=quantization_config,
199+
args=TrainingArguments(save_dir, num_train_epochs=2.0, do_train=True, do_eval=False),
200+
train_dataset=dataset["train"],
201+
eval_dataset=dataset["validation"],
202+
compute_metrics=compute_metrics,
203+
tokenizer=tokenizer,
204+
)
205+
206+
207+
# ## Train the model
208+
#
209+
# Now, let's train the model. We will use prepared trainer by executing `train` method on it.
210+
#
211+
# You can see, that after the training information about the model are printed under `*****Mixed Precision Statistics*****`.
212+
#
213+
# Now, the model use INT8 instead of FP32 in every layer.
214+
215+
216+
# In[ ]:
217+
218+
219+
train_result = trainer.train()
220+
221+
222+
# ## Evaluate the model
223+
#
224+
# After the training we should evaluate our model using `evaluate()` method on prepared trainer. It will show results for prepared before evaluation metrics - evaluation accuracy and loss. Additionally we will have information about evaluation time, samples and steps per second and number of epochs model was trained by.
225+
226+
227+
# In[ ]:
228+
229+
230+
metrics = trainer.evaluate()
231+
metrics
232+
233+
234+
# After the training it is important to save the model. One again we will use prepared trainer and other method - `save_model()`. Our model will be saved in the location provided before.
235+
# After that, to use this model in the future you just need load it similarly as at the beginning, using dedicated Intel® Neural Compressor optimized method `INCModelForSequenceClassification.from_pretrained(...)`.
236+
237+
238+
# In[ ]:
239+
240+
241+
# To use model in the future - save it!
242+
trainer.save_model()
243+
model = INCModelForSequenceClassification.from_pretrained(save_dir)
244+
245+
246+
# In this code sample we use BERT-tiny and emotion dataset to create text classification model using Intel® Neural Compressor Quantization Aware Training. We encourage you to experiment with this code sample changing model and datasets to make text models for different classification tasks.
247+
248+
249+
# In[ ]:
250+
251+
252+
print("[CODE_SAMPLE_COMPLETED_SUCCESFULLY]")
253+
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
Copyright Intel Corporation
2+
3+
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
4+
5+
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
6+
7+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

0 commit comments

Comments
 (0)