Updated documentation

ZappaBoy · ZappaBoy · commit 807894942610 · 2024-10-27T09:01:54.000+01:00
diff --git a/README.md b/README.md
@@ -1,12 +1,29 @@
-# neural-wave-hackathon
-This repository contains the code for the Neural Wave Hackathon.
+# Neural Wave - Hackathon 2024 - Lugano
 
-## Create a `.env` file. Start copying the `.env.example` file and rename it to `.env`. Fill in the required values.
+This repository contains the code produced by the `Molise.ai` team in the Neural Wave Hackathon 2024 competition in
+Lugano.
+
+## Challenge
+
+Here is a brief explanation of the challenge:
+The challenge was proposed by **Ai4Privacy**,a company that builds global solutions that enhance **privacy protections**
+in the rapidly evolving world of **Artificial Intelligence**.
+The challenge goal is to create a machine learning model capable of detecting and masking **PII** (Personal Identifiable
+Information) in text data across several languages and locales. The task requires working with a synthetic dataset to
+train models that can automatically identify and redact **17 types of PII** in natural language texts. The solution
+should aim for high accuracy while maintaining the **usability** of the underlying data.
+The final solution could be integrated into various systems and enhance privacy protections across industries,
+including client support, legal, and general data anonymization tools. Success in this project will contribute to
+scaling privacy-conscious AI systems without compromising the UX or operational performance.
+
+## Getting Started
+Create a `.env` file. Start copying the `.env.example` file and rename it to `.env`. Fill in the required values.
 ```bash
 cp .env.example .env
 ```
 
-## Install the dependencies
+### Install the dependencies
+
 ```bash
 pip install -r requirements.txt
 ```
@@ -20,30 +37,105 @@ export PYTHONPATH="${PYTHONPATH}:$PWD"
 ## Inference
 
 ### Inference on full dataset
+
+You can run inference on the complete test dataset using the following command:
+
 ```bash
 python inference.py -s ./dataset/test
 ```
 
 ### Inference on small dataset
+
+To perform inference on a small subset of the dataset, use the --subsample flag:
+
 ```bash
 python inference.py -s ./dataset/test --subsample
 ```
 
 ## Run ui
+
+To run the UI for interacting with the models and viewing results, use Streamlit:
+
 ```bash
 streamlit run ui.py
 ```
 
 ## Run api
+
+To start the API for the model, you'll need FastAPI. Run the following command:
+
 ```bash
 fastapi run api.py
 ```
 
 ## Experiments
 
-There are basically two macro types of experiments, both consists in pre-train models.
-The first one regards `BERT`, while the second one regards `GLiNER`.
+This repository supports two main types of experiments:
+
+1. Fine-tuning models from the BERT family.
+2. Fine-tuning models from the GLiNER family.
+
+Both experiment types are located in the `experiments/` folder, and each fine-tuning script allows you to pass specific
+arguments related to model choices, datasets, output directories, and optional alternative dataset columns.
+
+### BERT Fine-Tuning
+
+The BERT fine-tuning script enables you to fine-tune models from the BERT family on a specific dataset. Optionally, you
+can utilize alternative columns that are preprocessed during the data preparation phase.
 
 ```bash
+python experiments/bert_finetune.py --dataset path/to/dataset --model model_name --output_dir /path/to/output [--alternative_columns]
+```
+
+#### Available BERT models
+
+Here is a list of available BERT models that can be used for fine-tuning. Additional models based on the BERT tokenizer
+may also work with minimal modifications:
+
+- BERT classic
+    + `bert-base-uncased`, `bert-large-uncased`, `bert-base-cased`, `bert-large-cased`
+- DistilBERT
+    + `distilbert-base-uncased`, `distilbert-base-cased`
+- RoBERTa
+    + `roberta-base`, `roberta-large`
+- ALBERT
+    + `albert-base-v2`, `albert-large-v2`, `albert-xlarge-v2`, `albert-xxlarge-v2`
+- Electra
+    + `google/electra-small-discriminator`, `google/electra-base-discriminator`, `google/electra-large-discriminator`
+- DeBERTa
+    + `microsoft/deberta-base`, `microsoft/deberta-large`
+
+### GLiNER Fine-Tuning
+
+The GLiNER models require an additional dataset preparation step before starting the fine-tuning process. The process
+happens in two stages:
+
+1. Step 1: Prepare Dataset for GLiNER Models
+   Run the GLiNER dataset preparation script to pre-process your dataset:
+
+```bash
+python experiments/gliner_prepare.py --dataset path/to/dataset
+```
+
+This will create a new JSON-formatted dataset file with the same name in the specified output directory.
+
+2. Step 2: Fine-Tune GLiNER Model
+
+```bash
+python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
+```
+
+After the dataset preparation, run the GLiNER fine-tuning script:
+
+```bash
+python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
+```
+
+#### Available GLiNER models
+
+You can use the following GLiNER models for fine-tuning, though additional compatible models may work similarly:
 
-```
+- `gliner-community/gliner_xxl-v2.5`
+- `gliner-community/gliner_large-v2.5`
+- `gliner-community/gliner_medium-v2.5`
+- `gliner-community/gliner_small-v2.5`
diff --git a/experiment/gliner_finetune.py b/experiment/gliner_finetune.py
@@ -1,4 +1,5 @@
 import os
+
 os.environ["TOKENIZERS_PARALLELISM"] = "true"
 
 import torch
@@ -27,23 +28,24 @@ def get_argparser() -> argparse.ArgumentParser:
     required = parser.add_argument_group('required arguments')
     required.add_argument('--model', '-m',
                           metavar='NAME',
-                            dest='model',
-                            required=True,
-                            help='Model name to be used for training')
+                          dest='model',
+                          required=True,
+                          help='Model name to be used for training')
     required.add_argument('--dataset', '-d',
                           metavar='NAME',
                           dest='dataset',
                           required=True,
                           help='Dataset name to be used for training')
-
     return parser
 
 
 if __name__ == "__main__":
-
     # Read arg parameters
     parser = get_argparser()
     args = parser.parse_args()
+    text_column_name = 'source_text' if not args.alternative_columns else 'source_text_preprocessed'
+    label_column_name = 'privacy_mask' if not args.alternative_columns else 'privacy_mask_preprocessed'
+
 
     # Load the dataset
     with open(args.dataset, 'r') as f:
@@ -65,17 +67,17 @@ def get_argparser() -> argparse.ArgumentParser:
         weight_decay=0.01,
         others_lr=1e-5,
         others_weight_decay=0.01,
-        lr_scheduler_type="linear", #cosine
+        lr_scheduler_type="linear",  # cosine
         warmup_ratio=0.1,
         per_device_train_batch_size=16,
         num_train_epochs=10,
         evaluation_strategy="no",
         save_strategy="epoch",
         save_total_limit=100,
-        dataloader_num_workers = 0,
-        use_cpu = False,
+        dataloader_num_workers=0,
+        use_cpu=False,
         report_to="none",
-        )
+    )
 
     trainer = Trainer(
         model=model,
@@ -85,4 +87,4 @@ def get_argparser() -> argparse.ArgumentParser:
         data_collator=data_collator,
     )
 
-    trainer.train()
+    trainer.train()
diff --git a/experiment/gliner_prepare.py b/experiment/gliner_prepare.py
@@ -1,8 +1,12 @@
 from pathlib import Path
-from sklearn.model_selection import train_test_split
 import os, ast, json, spacy, argparse, pandas as pd
+
 os.environ["TOKENIZERS_PARALLELISM"] = "true"
 
+text_column_name = None
+label_column_name = None
+
+
 def get_argparser() -> argparse.ArgumentParser:
     """
     Get the configured argument parser
@@ -16,6 +20,10 @@ def get_argparser() -> argparse.ArgumentParser:
                           required=True,
                           help='Dataset name to be used for training')
 
+    required.add_argument('--alternative_columns', '-a',
+                          default=False, action=argparse.BooleanOptionalAction,
+                          help='Use alternative column names for source text and privacy mask')
+
     return parser
 
 
@@ -25,16 +33,17 @@ def save_data(data, file_path, overwrite):
     assets_dir = path.parent
     if not assets_dir.exists():
         assets_dir.mkdir(parents=True, exist_ok=True)
-    
+
     if not overwrite and path.exists():
         timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
         file_path = f"{path.stem}_{timestamp}{path.suffix}"
-    
+
     with open(file_path, 'w') as f:
         json.dump(data, f)
-    
+
     print(f"Data saved to {file_path}")
 
+
 def format_dataset(df: pd.DataFrame) -> list:
     """
     Format the dataset to the required format
@@ -44,16 +53,16 @@ def format_dataset(df: pd.DataFrame) -> list:
     for r in df.to_dict(orient='records'):
         patterns = []
         try:
-            for pm in ast.literal_eval(r["privacy_mask"]):
+            for pm in ast.literal_eval(r[label_column_name]):
                 patterns.append({"label": pm['label'], "pattern": pm['value'], "start": pm['start'], "end": pm['end']})
         except:
             print(json.dumps(r, indent=4))
             raise Exception()
         ft_dataset.append({
-            "text": r['source_text'], 
+            "text": r[text_column_name],
             "patterns": patterns,
             "lang": r["language"].lower()
-            })
+        })
 
     return ft_dataset
 
@@ -65,7 +74,10 @@ def spacy_tokenizer(text, tokenizer_lang):
     doc = tokenizer_lang(text)
     return [token.text for token in doc]
 
+
 TOKENIZER_SUPPORTED_LANGUAGES = ["en", "fr", "it", "es", "nl", "de"]
+
+
 def extract_annotations(text, patterns, tokenizer_lang):
     """
     Extract annotations from the text using the patterns
@@ -82,7 +94,7 @@ def extract_annotations(text, patterns, tokenizer_lang):
             splitted_text.extend(tokenized_text)
 
         tokenized_mask = spacy_tokenizer(mask, tokenizer_lang)
-        ner_list.append([len(splitted_text), len(splitted_text)+len(tokenized_mask)-1, tm["label"].upper()])
+        ner_list.append([len(splitted_text), len(splitted_text) + len(tokenized_mask) - 1, tm["label"].upper()])
         splitted_text.extend(tokenized_mask)
 
         if cnt == len(patterns) - 1:
@@ -103,7 +115,7 @@ def train_eval_test_split(data_patterns, out_file="eval.json", overwrite=True):
 
     training_data = list()
     for idx, d in enumerate(data_patterns):
-        print(f"Processing {idx+1}/{len(data_patterns)}")
+        print(f"Processing {idx + 1}/{len(data_patterns)}")
         data = extract_annotations(d["text"], d["patterns"], models_lang[d["lang"]])
         training_data.append(data)
 
@@ -112,11 +124,14 @@ def train_eval_test_split(data_patterns, out_file="eval.json", overwrite=True):
 
     return training_data
 
+
 if __name__ == "__main__":
 
     # Read arg parameters
     parser = get_argparser()
     args = parser.parse_args()
+    text_column_name = 'source_text' if not args.alternative_columns else 'source_text_preprocessed'
+    label_column_name = 'privacy_mask' if not args.alternative_columns else 'privacy_mask_preprocessed'
 
     # Load the dataset
     filename = args.dataset.split("/")[-1].split(".")[0] + ".json"