Skip to content

AI training report

Koffi Tino Gnagniko edited this page Feb 5, 2024 · 4 revisions

Introduction

Our goal was to create a model capable of accurately extracting and formatting information into predefined fields: title, description, category, service, customerPriority, priority, and requestType. After evaluating T5, BART, and GPT-2 models, we chose to fine-tune the T5 model due to its versatility and superior performance in various tasks.

Preliminary Training

To estimate the resources and time necessary for training, we conducted an initial training spike. Training Information:

  • Model: T5-Small
  • GPU: NVIDIA GeForce GTX 960
  • Parameters: 60 million
  • Dataset Size: 100 samples. This preliminary phase provided us with essential insights for our project's resource and time planning.

First Training Strategy

After completing the spike, we decided to train the T5-Large model (which has 220 million parameters) using a dataset of 100,000 entries to achieve optimal training results. We decided to utilize the Azure ML platform for the training. The compute instance (VM) we decided to use on the ML platform is the Standard_NC6s_v3, equipped with 6 cores, 112 GB RAM, and a 336-GB disk, and it includes 1 NVIDIA Tesla V100 GPU.

Unfortunately, we did not get access to a dataset from our industry partner. Consequently, we opted to train our model using a dataset of 10,000 entries generated with ChatGPT.

First Training

Each of our team members manually created data using ChatGPT. We successfully generated a dataset of 1,000 entries. Consequently, we trained the T5-Small model. The model can be tested at: Hugging Face: https://huggingface.co/TalkTix/t5-ticket-creator

Second Training

Unfortunately, the accuracy of the fine-tuned model was poor due to the small dataset. We then found a dataset of 8,000 entries that fit our task. We used the T5-Large model for this training. The model can be tested on Hugging Face at: https://huggingface.co/TalkTix/t5-ticket-creator-8k

Second Training Strategy

Also, the accuracy of after second training did not meet our requirements. As a result, we decided to change our strategy. We chose to use/train a separate model for each ticket field. For this purpose, we switched from text-generation training to text-classification. We decided to use RoBERTa-base, as it is better suited for classification tasks. RoBERTa is specifically optimized for text classification, its architecture is fine-tuned to excel in understanding context and nuances in language. In contrast, T5 is a powerful and flexible model for text generation, its strengths lie more in tasks like translation and summarization rather than the specific nuances of text classification. We also switched from manually generating test data to generating it via the ChatGPT API. We managed to generate approximately 53,000 entries for our dataset.

For the title, affected person, and keywords, we found pre-trained models that we could use to generate them directly.

The input text served as the description. For the fields of Service, Request Type, Category, Priority, and Customer Priority, we fine-tuned a text classification model for each of these fields.

Final training

With the second strategy, we achieved far better accuracy compared to the first strategy.