Skip to content

Commit 649f68d

Browse files
authored
Update README.md
1 parent e8d2767 commit 649f68d

File tree

1 file changed

+20
-2
lines changed

1 file changed

+20
-2
lines changed

Diff for: README.md

+20-2
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,22 @@
1-
# Parallel Detoxification Dataset
2-
This repository contains parallel detoxification dataset for the task of elimination toxicity from the texts. The pipeline used for this dataset collection was presented in "Crowdsourcing of Parallel Corpora: the Case of Style Transfer for Detoxification" paper presented at [VLDB 2021 Crowd Science Workshop](https://crowdscience.ai/conference_events/vldb21).
1+
# Parallel Text Detoxification Dataset
2+
This repository contains parallel text detoxification dataset for the task of elimination toxicity from the texts. The pipeline used for this dataset collection was presented in "Crowdsourcing of Parallel Corpora: the Case of Style Transfer for Detoxification" paper presented at [VLDB 2021 Crowd Science Workshop](https://crowdscience.ai/conference_events/vldb21).
33

4+
***
5+
📰 **Updates**
6+
7+
Check out **TextDetox** 🤗 https://huggingface.co/collections/textdetox/ -- continuation of ParaDetox project!
8+
9+
**[2025] !!!NOW OPEN!!! TextDetox CLEF2025 shared task: for even more -- 15 languages!** [website](https://pan.webis.de/clef25/pan25-web/text-detoxification.html) 🤗[Starter Kit](https://huggingface.co/collections/textdetox/)
10+
11+
**[2025] COLNG2025**: Daryna Dementieva, Nikolay Babakov, Amit Ronen, Abinew Ali Ayele, Naquee Rizwan, Florian Schneider, Xintong Wang, Seid Muhie Yimam, Daniil Alekhseevich Moskovskiy, Elisei Stakovskii, Eran Kaufman, Ashraf Elnagar, Animesh Mukherjee, and Alexander Panchenko. 2025. ***Multilingual and Explainable Text Detoxification with Parallel Corpora***. In Proceedings of the 31st International Conference on Computational Linguistics, pages 7998–8025, Abu Dhabi, UAE. Association for Computational Linguistics. [pdf](https://aclanthology.org/2025.coling-main.535/)
12+
13+
**[2024]** We have also created versions of ParaDetox in more languages. You can checkout a [RuParaDetox](https://huggingface.co/datasets/s-nlp/ru_paradetox) dataset as well as a [Multilingual TextDetox](https://huggingface.co/textdetox) project that includes 9 languages.
14+
15+
Corresponding papers:
16+
* [MultiParaDetox: Extending Text Detoxification with Parallel Data to New Languages](https://aclanthology.org/2024.naacl-short.12/) (NAACL 2024)
17+
* [Overview of the multilingual text detoxification task at pan 2024](https://ceur-ws.org/Vol-3740/paper-223.pdf) (CLEF Shared Task 2024)
18+
19+
**[2022] ParaDetox** for English, the full version with experiments, was presented at ACL2022! [repo](https://github.com/s-nlp/paradetox/tree/main) [paper](https://aclanthology.org/2022.acl-long.469/)
420
***
521

622
## Data Collection Methodology
@@ -9,6 +25,8 @@ The whole pipeline of the collection was divided into three tasks:
925
- Task 2: content preservation check of obtained results from Task 1;
1026
- Task 3: toxicity check of obtained results from Task 1;
1127

28+
The crowdsourcing was conducted with [Toloka.ai](https://toloka.ai) crowdsourcing platform.
29+
1230
Here you can see the schematical illustration of the collection pipeline:
1331

1432
![Alt text](https://github.com/skoltech-nlp/parallel_detoxification_dataset/blob/main/collection_pipeline_small.jpg)

0 commit comments

Comments
 (0)