Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add the classification process for EN WIKI #96

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

liniiiiii
Copy link
Collaborator

@i-be-snek , this is in low priority, if you have time, just have a look of the code, and I want to add it in the main branch because we have it described in the paper, thanks!

@liniiiiii liniiiiii self-assigned this Sep 4, 2024
@liniiiiii liniiiiii linked an issue Sep 4, 2024 that may be closed by this pull request
3 tasks
@i-be-snek
Copy link
Collaborator

@liniiiiii

Thanks! I'll look at it after we close the other open PRs, possibly next week.

@@ -0,0 +1,105 @@
# -*- coding: utf-8 -*-
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to think about why you want this directory to be in the root folder. Is it related to the Database? Maybe that would be a more appropriate location.

@@ -0,0 +1,10 @@
*** This is the classification process for English Wikipedia articles related to climate disasters.***
#Files description
[] Classfication_wikipedia.py is a script used for training the BERT model, and the training data is shuffled_training_dataset.csv
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use - instead of [] for markdown to how bullets

```shell
poetry run python3 BERT_Classification_EN_Wikipedia/Classifier_implement.py --filename wikipedia_dataset_preforclassify_20240229.csv --file_dir BERT_Classification_EN_Wikipedia
```
It takes long time to run for the all articles we collected, and we recommand to run it for new articles in Wikipedia after day 20240229.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use properly written dates rather than timestamps (which are harder to read)

[] wikipedia_dataset_preforclassify_20240229.csv contains all articles we collected using the keywords searching
[] Classifier_implement.py is a script to implement the classification model, the command you can refer to use this model is:
```shell
poetry run python3 BERT_Classification_EN_Wikipedia/Classifier_implement.py --filename wikipedia_dataset_preforclassify_20240229.csv --file_dir BERT_Classification_EN_Wikipedia
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct the directory name:

Suggested change
poetry run python3 BERT_Classification_EN_Wikipedia/Classifier_implement.py --filename wikipedia_dataset_preforclassify_20240229.csv --file_dir BERT_Classification_EN_Wikipedia
poetry run python3 BERT_Classification_EN_WIKI/Classifier_implement.py --filename wikipedia_dataset_preforclassify_20240229.csv --file_dir BERT_Classification_EN_WIKI

model = AutoModelForSequenceClassification.from_pretrained(
"liniiiiii/DistilBertForSequenceClassification_WIKI_Natural_disaster"
)
tokenizer = AutoTokenizer.from_pretrained("liniiiiii/DistilBertForSequenceClassification_WIKI_Natural_disaster")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job pushing the model to HF :)

@@ -0,0 +1,74 @@
# -*- coding: utf-8 -*-
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The script names Classifier_implement.py and Classification_wikipedia.py could have more descriptive names, such as classifier.py or trainer.py.

text = str(text) # Convert to string if not already

# Split the text into segments of 512 tokens
tokenized_text = tokenizer.encode_plus(text, add_special_tokens=True, truncation=True, max_length=512)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This variable is never used anywhere. Maybe check why?

from Database.scr.normalize_utils import Logging

if __name__ == "__main__":
logger = Logging.get_logger("classification training")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could help to add more useful logs to both .py files

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the name of this file BERT_Classification_EN_WIKI/wikipedia_dataset_preforclassify_20240229.csv? What does prefor mean?

tokenizer = AutoTokenizer.from_pretrained("liniiiiii/DistilBertForSequenceClassification_WIKI_Natural_disaster")


from Database.scr.normalize_utils import Logging
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's much better to keep all imports at the top to make the code easy to read


# Classify each text in the dataset
results = []
for _, row in df.iterrows():
Copy link
Collaborator

@i-be-snek i-be-snek Sep 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To show the progress to users:

Suggested change
for _, row in df.iterrows():
for _, row in tqdm(df.iterrows(), total=df.shape[0]):

⚠️ don't forget to import tqdm at the top!

from tqdm import tqdm

That way you get a visual progress bar:

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you also mean to push wikipedia_dataset_preforclassify_20240229_classified.csv? I have it since I ran the script to test it, so let me know and I can push it to the branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upload the classification for English Wikipedia
2 participants