1
- # neural-wave-hackathon
2
- This repository contains the code for the Neural Wave Hackathon.
1
+ # Neural Wave - Hackathon 2024 - Lugano
3
2
4
- ## Create a ` .env ` file. Start copying the ` .env.example ` file and rename it to ` .env ` . Fill in the required values.
3
+ This repository contains the code produced by the ` Molise.ai ` team in the Neural Wave Hackathon 2024 competition in
4
+ Lugano.
5
+
6
+ ## Challenge
7
+
8
+ Here is a brief explanation of the challenge:
9
+ The challenge was proposed by ** Ai4Privacy** ,a company that builds global solutions that enhance ** privacy protections**
10
+ in the rapidly evolving world of ** Artificial Intelligence** .
11
+ The challenge goal is to create a machine learning model capable of detecting and masking ** PII** (Personal Identifiable
12
+ Information) in text data across several languages and locales. The task requires working with a synthetic dataset to
13
+ train models that can automatically identify and redact ** 17 types of PII** in natural language texts. The solution
14
+ should aim for high accuracy while maintaining the ** usability** of the underlying data.
15
+ The final solution could be integrated into various systems and enhance privacy protections across industries,
16
+ including client support, legal, and general data anonymization tools. Success in this project will contribute to
17
+ scaling privacy-conscious AI systems without compromising the UX or operational performance.
18
+
19
+ ## Getting Started
20
+ Create a ` .env ` file. Start copying the ` .env.example ` file and rename it to ` .env ` . Fill in the required values.
5
21
``` bash
6
22
cp .env.example .env
7
23
```
8
24
9
- ## Install the dependencies
25
+ ### Install the dependencies
26
+
10
27
``` bash
11
28
pip install -r requirements.txt
12
29
```
@@ -20,30 +37,105 @@ export PYTHONPATH="${PYTHONPATH}:$PWD"
20
37
## Inference
21
38
22
39
### Inference on full dataset
40
+
41
+ You can run inference on the complete test dataset using the following command:
42
+
23
43
``` bash
24
44
python inference.py -s ./dataset/test
25
45
```
26
46
27
47
### Inference on small dataset
48
+
49
+ To perform inference on a small subset of the dataset, use the --subsample flag:
50
+
28
51
``` bash
29
52
python inference.py -s ./dataset/test --subsample
30
53
```
31
54
32
55
## Run ui
56
+
57
+ To run the UI for interacting with the models and viewing results, use Streamlit:
58
+
33
59
``` bash
34
60
streamlit run ui.py
35
61
```
36
62
37
63
## Run api
64
+
65
+ To start the API for the model, you'll need FastAPI. Run the following command:
66
+
38
67
``` bash
39
68
fastapi run api.py
40
69
```
41
70
42
71
## Experiments
43
72
44
- There are basically two macro types of experiments, both consists in pre-train models.
45
- The first one regards ` BERT ` , while the second one regards ` GLiNER ` .
73
+ This repository supports two main types of experiments:
74
+
75
+ 1 . Fine-tuning models from the BERT family.
76
+ 2 . Fine-tuning models from the GLiNER family.
77
+
78
+ Both experiment types are located in the ` experiments/ ` folder, and each fine-tuning script allows you to pass specific
79
+ arguments related to model choices, datasets, output directories, and optional alternative dataset columns.
80
+
81
+ ### BERT Fine-Tuning
82
+
83
+ The BERT fine-tuning script enables you to fine-tune models from the BERT family on a specific dataset. Optionally, you
84
+ can utilize alternative columns that are preprocessed during the data preparation phase.
46
85
47
86
``` bash
87
+ python experiments/bert_finetune.py --dataset path/to/dataset --model model_name --output_dir /path/to/output [--alternative_columns]
88
+ ```
89
+
90
+ #### Available BERT models
91
+
92
+ Here is a list of available BERT models that can be used for fine-tuning. Additional models based on the BERT tokenizer
93
+ may also work with minimal modifications:
94
+
95
+ - BERT classic
96
+ + ` bert-base-uncased ` , ` bert-large-uncased ` , ` bert-base-cased ` , ` bert-large-cased `
97
+ - DistilBERT
98
+ + ` distilbert-base-uncased ` , ` distilbert-base-cased `
99
+ - RoBERTa
100
+ + ` roberta-base ` , ` roberta-large `
101
+ - ALBERT
102
+ + ` albert-base-v2 ` , ` albert-large-v2 ` , ` albert-xlarge-v2 ` , ` albert-xxlarge-v2 `
103
+ - Electra
104
+ + ` google/electra-small-discriminator ` , ` google/electra-base-discriminator ` , ` google/electra-large-discriminator `
105
+ - DeBERTa
106
+ + ` microsoft/deberta-base ` , ` microsoft/deberta-large `
107
+
108
+ ### GLiNER Fine-Tuning
109
+
110
+ The GLiNER models require an additional dataset preparation step before starting the fine-tuning process. The process
111
+ happens in two stages:
112
+
113
+ 1 . Step 1: Prepare Dataset for GLiNER Models
114
+ Run the GLiNER dataset preparation script to pre-process your dataset:
115
+
116
+ ``` bash
117
+ python experiments/gliner_prepare.py --dataset path/to/dataset
118
+ ```
119
+
120
+ This will create a new JSON-formatted dataset file with the same name in the specified output directory.
121
+
122
+ 2 . Step 2: Fine-Tune GLiNER Model
123
+
124
+ ``` bash
125
+ python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
126
+ ```
127
+
128
+ After the dataset preparation, run the GLiNER fine-tuning script:
129
+
130
+ ``` bash
131
+ python experiments/gliner_finetune.py --dataset path/to/prepared/dataset.json --model model_name --output_dir /path/to/output [--alternative_columns]
132
+ ```
133
+
134
+ #### Available GLiNER models
135
+
136
+ You can use the following GLiNER models for fine-tuning, though additional compatible models may work similarly:
48
137
49
- ```
138
+ - ` gliner-community/gliner_xxl-v2.5 `
139
+ - ` gliner-community/gliner_large-v2.5 `
140
+ - ` gliner-community/gliner_medium-v2.5 `
141
+ - ` gliner-community/gliner_small-v2.5 `
0 commit comments