Skip to content

Commit 8078949

Browse files
committed
Updated documentation
1 parent b995286 commit 8078949

File tree

3 files changed

+135
-26
lines changed

3 files changed

+135
-26
lines changed

README.md

+99-7
Original file line numberDiff line numberDiff line change
@@ -1,12 +1,29 @@
1-
# neural-wave-hackathon
2-
This repository contains the code for the Neural Wave Hackathon.
1+
# Neural Wave - Hackathon 2024 - Lugano
32

4-
## Create a `.env` file. Start copying the `.env.example` file and rename it to `.env`. Fill in the required values.
3+
This repository contains the code produced by the `Molise.ai` team in the Neural Wave Hackathon 2024 competition in
4+
Lugano.
5+
6+
## Challenge
7+
8+
Here is a brief explanation of the challenge:
9+
The challenge was proposed by **Ai4Privacy**,a company that builds global solutions that enhance **privacy protections**
10+
in the rapidly evolving world of **Artificial Intelligence**.
11+
The challenge goal is to create a machine learning model capable of detecting and masking **PII** (Personal Identifiable
12+
Information) in text data across several languages and locales. The task requires working with a synthetic dataset to
13+
train models that can automatically identify and redact **17 types of PII** in natural language texts. The solution
14+
should aim for high accuracy while maintaining the **usability** of the underlying data.
15+
The final solution could be integrated into various systems and enhance privacy protections across industries,
16+
including client support, legal, and general data anonymization tools. Success in this project will contribute to
17+
scaling privacy-conscious AI systems without compromising the UX or operational performance.
18+
19+
## Getting Started
20+
Create a `.env` file. Start copying the `.env.example` file and rename it to `.env`. Fill in the required values.
521
```bash
622
cp .env.example .env
723
```
824

9-
## Install the dependencies
25+
### Install the dependencies
26+
1027
```bash
1128
pip install -r requirements.txt
1229
```
@@ -20,30 +37,105 @@ export PYTHONPATH="${PYTHONPATH}:$PWD"
2037
## Inference
2138

2239
### Inference on full dataset
40+
41+
You can run inference on the complete test dataset using the following command:
42+
2343
```bash
2444
python inference.py -s ./dataset/test
2545
```
2646

2747
### Inference on small dataset
48+
49+
To perform inference on a small subset of the dataset, use the --subsample flag:
50+
2851
```bash
2952
python inference.py -s ./dataset/test --subsample
3053
```
3154

3255
## Run ui
56+
57+
To run the UI for interacting with the models and viewing results, use Streamlit:
58+
3359
```bash
3460
streamlit run ui.py
3561
```
3662

3763
## Run api
64+
65+
To start the API for the model, you'll need FastAPI. Run the following command:
66+
3867
```bash
3968
fastapi run api.py
4069
```
4170

4271
## Experiments
4372

44-
There are basically two macro types of experiments, both consists in pre-train models.
45-
The first one regards `BERT`, while the second one regards `GLiNER`.
73+
This repository supports two main types of experiments:
74+
75+
1. Fine-tuning models from the BERT family.
76+
2. Fine-tuning models from the GLiNER family.
77+
78+
Both experiment types are located in the `experiments/` folder, and each fine-tuning script allows you to pass specific
79+
arguments related to model choices, datasets, output directories, and optional alternative dataset columns.
80+
81+
### BERT Fine-Tuning
82+
83+
The BERT fine-tuning script enables you to fine-tune models from the BERT family on a specific dataset. Optionally, you
84+
can utilize alternative columns that are preprocessed during the data preparation phase.
4685

4786
```bash
87+
python experiments/bert_finetune.py --dataset path/to/dataset --model model_name --output_dir /path/to/output [--alternative_columns]
88+
```
89+
90+
#### Available BERT models
91+
92+
Here is a list of available BERT models that can be used for fine-tuning. Additional models based on the BERT tokenizer
93+
may also work with minimal modifications:
94+
95+
- BERT classic
96+
+ `bert-base-uncased`, `bert-large-uncased`, `bert-base-cased`, `bert-large-cased`
97+
- DistilBERT
98+
+ `distilbert-base-uncased`, `distilbert-base-cased`
99+
- RoBERTa
100+
+ `roberta-base`, `roberta-large`
101+
- ALBERT
102+
+ `albert-base-v2`, `albert-large-v2`, `albert-xlarge-v2`, `albert-xxlarge-v2`
103+
- Electra
104+
+ `google/electra-small-discriminator`, `google/electra-base-discriminator`, `google/electra-large-discriminator`
105+
- DeBERTa
106+
+ `microsoft/deberta-base`, `microsoft/deberta-large`
107+
108+
### GLiNER Fine-Tuning
109+
110+
The GLiNER models require an additional dataset preparation step before starting the fine-tuning process. The process
111+
happens in two stages:
112+
113+
1. Step 1: Prepare Dataset for GLiNER Models
114+
Run the GLiNER dataset preparation script to pre-process your dataset:
115+
116+
```bash
117+
python experiments/gliner_prepare.py --dataset path/to/dataset
118+
```
119+
120+
This will create a new JSON-formatted dataset file with the same name in the specified output directory.
121+
122+
2. Step 2: Fine-Tune GLiNER Model
123+
124+
```bash
125+
python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
126+
```
127+
128+
After the dataset preparation, run the GLiNER fine-tuning script:
129+
130+
```bash
131+
python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
132+
```
133+
134+
#### Available GLiNER models
135+
136+
You can use the following GLiNER models for fine-tuning, though additional compatible models may work similarly:
48137

49-
```
138+
- `gliner-community/gliner_xxl-v2.5`
139+
- `gliner-community/gliner_large-v2.5`
140+
- `gliner-community/gliner_medium-v2.5`
141+
- `gliner-community/gliner_small-v2.5`

experiment/gliner_finetune.py

+12-10
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,5 @@
11
import os
2+
23
os.environ["TOKENIZERS_PARALLELISM"] = "true"
34

45
import torch
@@ -27,23 +28,24 @@ def get_argparser() -> argparse.ArgumentParser:
2728
required = parser.add_argument_group('required arguments')
2829
required.add_argument('--model', '-m',
2930
metavar='NAME',
30-
dest='model',
31-
required=True,
32-
help='Model name to be used for training')
31+
dest='model',
32+
required=True,
33+
help='Model name to be used for training')
3334
required.add_argument('--dataset', '-d',
3435
metavar='NAME',
3536
dest='dataset',
3637
required=True,
3738
help='Dataset name to be used for training')
38-
3939
return parser
4040

4141

4242
if __name__ == "__main__":
43-
4443
# Read arg parameters
4544
parser = get_argparser()
4645
args = parser.parse_args()
46+
text_column_name = 'source_text' if not args.alternative_columns else 'source_text_preprocessed'
47+
label_column_name = 'privacy_mask' if not args.alternative_columns else 'privacy_mask_preprocessed'
48+
4749

4850
# Load the dataset
4951
with open(args.dataset, 'r') as f:
@@ -65,17 +67,17 @@ def get_argparser() -> argparse.ArgumentParser:
6567
weight_decay=0.01,
6668
others_lr=1e-5,
6769
others_weight_decay=0.01,
68-
lr_scheduler_type="linear", #cosine
70+
lr_scheduler_type="linear", # cosine
6971
warmup_ratio=0.1,
7072
per_device_train_batch_size=16,
7173
num_train_epochs=10,
7274
evaluation_strategy="no",
7375
save_strategy="epoch",
7476
save_total_limit=100,
75-
dataloader_num_workers = 0,
76-
use_cpu = False,
77+
dataloader_num_workers=0,
78+
use_cpu=False,
7779
report_to="none",
78-
)
80+
)
7981

8082
trainer = Trainer(
8183
model=model,
@@ -85,4 +87,4 @@ def get_argparser() -> argparse.ArgumentParser:
8587
data_collator=data_collator,
8688
)
8789

88-
trainer.train()
90+
trainer.train()

experiment/gliner_prepare.py

+24-9
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,12 @@
11
from pathlib import Path
2-
from sklearn.model_selection import train_test_split
32
import os, ast, json, spacy, argparse, pandas as pd
3+
44
os.environ["TOKENIZERS_PARALLELISM"] = "true"
55

6+
text_column_name = None
7+
label_column_name = None
8+
9+
610
def get_argparser() -> argparse.ArgumentParser:
711
"""
812
Get the configured argument parser
@@ -16,6 +20,10 @@ def get_argparser() -> argparse.ArgumentParser:
1620
required=True,
1721
help='Dataset name to be used for training')
1822

23+
required.add_argument('--alternative_columns', '-a',
24+
default=False, action=argparse.BooleanOptionalAction,
25+
help='Use alternative column names for source text and privacy mask')
26+
1927
return parser
2028

2129

@@ -25,16 +33,17 @@ def save_data(data, file_path, overwrite):
2533
assets_dir = path.parent
2634
if not assets_dir.exists():
2735
assets_dir.mkdir(parents=True, exist_ok=True)
28-
36+
2937
if not overwrite and path.exists():
3038
timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
3139
file_path = f"{path.stem}_{timestamp}{path.suffix}"
32-
40+
3341
with open(file_path, 'w') as f:
3442
json.dump(data, f)
35-
43+
3644
print(f"Data saved to {file_path}")
3745

46+
3847
def format_dataset(df: pd.DataFrame) -> list:
3948
"""
4049
Format the dataset to the required format
@@ -44,16 +53,16 @@ def format_dataset(df: pd.DataFrame) -> list:
4453
for r in df.to_dict(orient='records'):
4554
patterns = []
4655
try:
47-
for pm in ast.literal_eval(r["privacy_mask"]):
56+
for pm in ast.literal_eval(r[label_column_name]):
4857
patterns.append({"label": pm['label'], "pattern": pm['value'], "start": pm['start'], "end": pm['end']})
4958
except:
5059
print(json.dumps(r, indent=4))
5160
raise Exception()
5261
ft_dataset.append({
53-
"text": r['source_text'],
62+
"text": r[text_column_name],
5463
"patterns": patterns,
5564
"lang": r["language"].lower()
56-
})
65+
})
5766

5867
return ft_dataset
5968

@@ -65,7 +74,10 @@ def spacy_tokenizer(text, tokenizer_lang):
6574
doc = tokenizer_lang(text)
6675
return [token.text for token in doc]
6776

77+
6878
TOKENIZER_SUPPORTED_LANGUAGES = ["en", "fr", "it", "es", "nl", "de"]
79+
80+
6981
def extract_annotations(text, patterns, tokenizer_lang):
7082
"""
7183
Extract annotations from the text using the patterns
@@ -82,7 +94,7 @@ def extract_annotations(text, patterns, tokenizer_lang):
8294
splitted_text.extend(tokenized_text)
8395

8496
tokenized_mask = spacy_tokenizer(mask, tokenizer_lang)
85-
ner_list.append([len(splitted_text), len(splitted_text)+len(tokenized_mask)-1, tm["label"].upper()])
97+
ner_list.append([len(splitted_text), len(splitted_text) + len(tokenized_mask) - 1, tm["label"].upper()])
8698
splitted_text.extend(tokenized_mask)
8799

88100
if cnt == len(patterns) - 1:
@@ -103,7 +115,7 @@ def train_eval_test_split(data_patterns, out_file="eval.json", overwrite=True):
103115

104116
training_data = list()
105117
for idx, d in enumerate(data_patterns):
106-
print(f"Processing {idx+1}/{len(data_patterns)}")
118+
print(f"Processing {idx + 1}/{len(data_patterns)}")
107119
data = extract_annotations(d["text"], d["patterns"], models_lang[d["lang"]])
108120
training_data.append(data)
109121

@@ -112,11 +124,14 @@ def train_eval_test_split(data_patterns, out_file="eval.json", overwrite=True):
112124

113125
return training_data
114126

127+
115128
if __name__ == "__main__":
116129

117130
# Read arg parameters
118131
parser = get_argparser()
119132
args = parser.parse_args()
133+
text_column_name = 'source_text' if not args.alternative_columns else 'source_text_preprocessed'
134+
label_column_name = 'privacy_mask' if not args.alternative_columns else 'privacy_mask_preprocessed'
120135

121136
# Load the dataset
122137
filename = args.dataset.split("/")[-1].split(".")[0] + ".json"

0 commit comments

Comments
 (0)