6
6
## Challenge
7
7
8
8
Here is a brief explanation of the challenge:
9
- The challenge was proposed by ** Ai4Privacy** , a company that builds global solutions that enhance ** privacy protections**
9
+ The challenge was proposed by ** Ai4Privacy** , a company that builds global solutions that enhance ** privacy protections
10
+ **
10
11
in the rapidly evolving world of ** Artificial Intelligence** .
11
12
The challenge goal is to create a machine learning model capable of detecting and masking ** PII** (Personal Identifiable
12
13
Information) in text data across several languages and locales. The task requires working with a synthetic dataset to
@@ -17,7 +18,9 @@ including client support, legal, and general data anonymization tools. Success i
17
18
scaling privacy-conscious AI systems without compromising the UX or operational performance.
18
19
19
20
## Getting Started
21
+
20
22
Create a ` .env ` file. Start copying the ` .env.example ` file and rename it to ` .env ` . Fill in the required values.
23
+
21
24
``` bash
22
25
cp .env.example .env
23
26
```
@@ -93,17 +96,17 @@ Here is a list of available BERT models that can be used for fine-tuning. Additi
93
96
may also work with minimal modifications:
94
97
95
98
- BERT classic
96
- + ` bert-base-uncased ` , ` bert-large-uncased ` , ` bert-base-cased ` , ` bert-large-cased `
99
+ + ` bert-base-uncased ` , ` bert-large-uncased ` , ` bert-base-cased ` , ` bert-large-cased `
97
100
- DistilBERT
98
- + ` distilbert-base-uncased ` , ` distilbert-base-cased `
101
+ + ` distilbert-base-uncased ` , ` distilbert-base-cased `
99
102
- RoBERTa
100
- + ` roberta-base ` , ` roberta-large `
103
+ + ` roberta-base ` , ` roberta-large `
101
104
- ALBERT
102
- + ` albert-base-v2 ` , ` albert-large-v2 ` , ` albert-xlarge-v2 ` , ` albert-xxlarge-v2 `
105
+ + ` albert-base-v2 ` , ` albert-large-v2 ` , ` albert-xlarge-v2 ` , ` albert-xxlarge-v2 `
103
106
- Electra
104
- + ` google/electra-small-discriminator ` , ` google/electra-base-discriminator ` , ` google/electra-large-discriminator `
107
+ + ` google/electra-small-discriminator ` , ` google/electra-base-discriminator ` , ` google/electra-large-discriminator `
105
108
- DeBERTa
106
- + ` microsoft/deberta-base ` , ` microsoft/deberta-large `
109
+ + ` microsoft/deberta-base ` , ` microsoft/deberta-large `
107
110
108
111
### GLiNER Fine-Tuning
109
112
@@ -141,4 +144,12 @@ You can use the following GLiNER models for fine-tuning, though additional compa
141
144
- ` gliner-community/gliner_small-v2.5 `
142
145
143
146
## Results
147
+
144
148
A results folder is available in the repository to store the results of the various experiments and related metrics.
149
+
150
+ ## Other Information
151
+
152
+ We also provide a solution to the issue in
153
+ the [ pii-masking-400k] ( https://huggingface.co/datasets/ai4privacy/pii-masking-400k/discussions/3 ) repository.
154
+ We created a method to transform the natural language text into a token-tag format that can be used to train a Named
155
+ Entity Recognition (NER) model using the ` AutoTrain ` ` huggingface ` api.
0 commit comments