add the classification process for EN WIKI #96

liniiiiii · 2024-09-04T15:12:47Z

@i-be-snek , this is in low priority, if you have time, just have a look of the code, and I want to add it in the main branch because we have it described in the paper, thanks!

i-be-snek · 2024-09-04T17:55:12Z

@liniiiiii

Thanks! I'll look at it after we close the other open PRs, possibly next week.

…ipedia

i-be-snek · 2024-09-08T19:22:05Z

BERT_Classification_EN_WIKI/Classification_wikipedia.py

@@ -0,0 +1,105 @@
+# -*- coding: utf-8 -*-


It's good to think about why you want this directory to be in the root folder. Is it related to the Database? Maybe that would be a more appropriate location.

i-be-snek · 2024-09-08T19:22:43Z

BERT_Classification_EN_WIKI/README.md

@@ -0,0 +1,10 @@
+*** This is the classification process for English Wikipedia articles related to climate disasters.***
+#Files description
+[] Classfication_wikipedia.py is a script used for training the BERT model, and the training data is shuffled_training_dataset.csv


Use - instead of [] for markdown to how bullets

i-be-snek · 2024-09-08T19:23:20Z

BERT_Classification_EN_WIKI/README.md

+```shell
+poetry run python3 BERT_Classification_EN_Wikipedia/Classifier_implement.py  --filename wikipedia_dataset_preforclassify_20240229.csv  --file_dir BERT_Classification_EN_Wikipedia
+```
+It takes long time to run for the all articles we collected, and we recommand to run it for new articles in Wikipedia after day 20240229.


Use properly written dates rather than timestamps (which are harder to read)

i-be-snek · 2024-09-08T19:31:37Z

BERT_Classification_EN_WIKI/README.md

+[] wikipedia_dataset_preforclassify_20240229.csv contains all articles we collected using the keywords searching
+[] Classifier_implement.py is a script to implement the classification model, the command you can refer to use this model is:
+```shell
+poetry run python3 BERT_Classification_EN_Wikipedia/Classifier_implement.py  --filename wikipedia_dataset_preforclassify_20240229.csv  --file_dir BERT_Classification_EN_Wikipedia


Correct the directory name:

Suggested change

poetry run python3 BERT_Classification_EN_Wikipedia/Classifier_implement.py --filename wikipedia_dataset_preforclassify_20240229.csv --file_dir BERT_Classification_EN_Wikipedia

poetry run python3 BERT_Classification_EN_WIKI/Classifier_implement.py --filename wikipedia_dataset_preforclassify_20240229.csv --file_dir BERT_Classification_EN_WIKI

i-be-snek · 2024-09-08T19:32:58Z

BERT_Classification_EN_WIKI/Classifier_implement.py

+model = AutoModelForSequenceClassification.from_pretrained(
+    "liniiiiii/DistilBertForSequenceClassification_WIKI_Natural_disaster"
+)
+tokenizer = AutoTokenizer.from_pretrained("liniiiiii/DistilBertForSequenceClassification_WIKI_Natural_disaster")


Great job pushing the model to HF :)

i-be-snek · 2024-09-08T19:34:35Z

BERT_Classification_EN_WIKI/Classifier_implement.py

@@ -0,0 +1,74 @@
+# -*- coding: utf-8 -*-


The script names Classifier_implement.py and Classification_wikipedia.py could have more descriptive names, such as classifier.py or trainer.py.

i-be-snek · 2024-09-08T19:35:32Z

BERT_Classification_EN_WIKI/Classifier_implement.py

+            text = str(text)  # Convert to string if not already
+
+        # Split the text into segments of 512 tokens
+        tokenized_text = tokenizer.encode_plus(text, add_special_tokens=True, truncation=True, max_length=512)


This variable is never used anywhere. Maybe check why?

i-be-snek · 2024-09-08T19:36:37Z

BERT_Classification_EN_WIKI/Classifier_implement.py

+from Database.scr.normalize_utils import Logging
+
+if __name__ == "__main__":
+    logger = Logging.get_logger("classification training")


It could help to add more useful logs to both .py files

i-be-snek · 2024-09-08T19:38:17Z

BERT_Classification_EN_WIKI/wikipedia_dataset_preforclassify_20240229.csv

Why is the name of this file BERT_Classification_EN_WIKI/wikipedia_dataset_preforclassify_20240229.csv? What does prefor mean?

i-be-snek · 2024-09-08T19:44:46Z

BERT_Classification_EN_WIKI/Classifier_implement.py

+tokenizer = AutoTokenizer.from_pretrained("liniiiiii/DistilBertForSequenceClassification_WIKI_Natural_disaster")
+
+
+from Database.scr.normalize_utils import Logging


It's much better to keep all imports at the top to make the code easy to read

i-be-snek · 2024-09-08T19:45:59Z

BERT_Classification_EN_WIKI/Classifier_implement.py

+
+    # Classify each text in the dataset
+    results = []
+    for _, row in df.iterrows():


To show the progress to users:

Suggested change

for _, row in df.iterrows():

for _, row in tqdm(df.iterrows(), total=df.shape[0]):

⚠️ don't forget to import tqdm at the top!

from tqdm import tqdm

That way you get a visual progress bar:

i-be-snek · 2024-09-08T20:16:16Z

BERT_Classification_EN_WIKI/wikipedia_dataset_preforclassify_20240229.csv

Did you also mean to push wikipedia_dataset_preforclassify_20240229_classified.csv? I have it since I ran the script to test it, so let me know and I can push it to the branch.

…ipedia

add the classification process for EN WIKI

08e00c1

liniiiiii self-assigned this Sep 4, 2024

liniiiiii linked an issue Sep 4, 2024 that may be closed by this pull request

Upload the classification for English Wikipedia #95

Closed

3 tasks

liniiiiii added the low priority label Sep 5, 2024

Merge branch 'main' into 95-upload-the-classification-for-english-wik…

c18b209

…ipedia

i-be-snek requested changes Sep 8, 2024

View reviewed changes

i-be-snek reviewed Sep 8, 2024

View reviewed changes

Merge branch 'main' into 95-upload-the-classification-for-english-wik…

d984b87

…ipedia

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

add the classification process for EN WIKI #96

add the classification process for EN WIKI #96

liniiiiii commented Sep 4, 2024

i-be-snek commented Sep 4, 2024

i-be-snek Sep 8, 2024

i-be-snek Sep 8, 2024

i-be-snek Sep 8, 2024

i-be-snek Sep 8, 2024

i-be-snek Sep 8, 2024

i-be-snek Sep 8, 2024

i-be-snek Sep 8, 2024

i-be-snek Sep 8, 2024

i-be-snek Sep 8, 2024

i-be-snek Sep 8, 2024

i-be-snek Sep 8, 2024 •

edited

Loading

i-be-snek Sep 8, 2024

	poetry run python3 BERT_Classification_EN_Wikipedia/Classifier_implement.py --filename wikipedia_dataset_preforclassify_20240229.csv --file_dir BERT_Classification_EN_Wikipedia
	poetry run python3 BERT_Classification_EN_WIKI/Classifier_implement.py --filename wikipedia_dataset_preforclassify_20240229.csv --file_dir BERT_Classification_EN_WIKI

		tokenizer = AutoTokenizer.from_pretrained("liniiiiii/DistilBertForSequenceClassification_WIKI_Natural_disaster")


		from Database.scr.normalize_utils import Logging

	for _, row in df.iterrows():
	for _, row in tqdm(df.iterrows(), total=df.shape[0]):

add the classification process for EN WIKI #96

Are you sure you want to change the base?

add the classification process for EN WIKI #96

Conversation

liniiiiii commented Sep 4, 2024

i-be-snek commented Sep 4, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

i-be-snek Sep 8, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

i-be-snek Sep 8, 2024 •

edited

Loading