-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add the classification process for EN WIKI #96
base: main
Are you sure you want to change the base?
add the classification process for EN WIKI #96
Conversation
Thanks! I'll look at it after we close the other open PRs, possibly next week. |
@@ -0,0 +1,105 @@ | |||
# -*- coding: utf-8 -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to think about why you want this directory to be in the root folder. Is it related to the Database? Maybe that would be a more appropriate location.
@@ -0,0 +1,10 @@ | |||
*** This is the classification process for English Wikipedia articles related to climate disasters.*** | |||
#Files description | |||
[] Classfication_wikipedia.py is a script used for training the BERT model, and the training data is shuffled_training_dataset.csv |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use -
instead of []
for markdown to how bullets
```shell | ||
poetry run python3 BERT_Classification_EN_Wikipedia/Classifier_implement.py --filename wikipedia_dataset_preforclassify_20240229.csv --file_dir BERT_Classification_EN_Wikipedia | ||
``` | ||
It takes long time to run for the all articles we collected, and we recommand to run it for new articles in Wikipedia after day 20240229. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use properly written dates rather than timestamps (which are harder to read)
[] wikipedia_dataset_preforclassify_20240229.csv contains all articles we collected using the keywords searching | ||
[] Classifier_implement.py is a script to implement the classification model, the command you can refer to use this model is: | ||
```shell | ||
poetry run python3 BERT_Classification_EN_Wikipedia/Classifier_implement.py --filename wikipedia_dataset_preforclassify_20240229.csv --file_dir BERT_Classification_EN_Wikipedia |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct the directory name:
poetry run python3 BERT_Classification_EN_Wikipedia/Classifier_implement.py --filename wikipedia_dataset_preforclassify_20240229.csv --file_dir BERT_Classification_EN_Wikipedia | |
poetry run python3 BERT_Classification_EN_WIKI/Classifier_implement.py --filename wikipedia_dataset_preforclassify_20240229.csv --file_dir BERT_Classification_EN_WIKI |
model = AutoModelForSequenceClassification.from_pretrained( | ||
"liniiiiii/DistilBertForSequenceClassification_WIKI_Natural_disaster" | ||
) | ||
tokenizer = AutoTokenizer.from_pretrained("liniiiiii/DistilBertForSequenceClassification_WIKI_Natural_disaster") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job pushing the model to HF :)
@@ -0,0 +1,74 @@ | |||
# -*- coding: utf-8 -*- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The script names Classifier_implement.py
and Classification_wikipedia.py
could have more descriptive names, such as classifier.py
or trainer.py
.
text = str(text) # Convert to string if not already | ||
|
||
# Split the text into segments of 512 tokens | ||
tokenized_text = tokenizer.encode_plus(text, add_special_tokens=True, truncation=True, max_length=512) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable is never used anywhere. Maybe check why?
from Database.scr.normalize_utils import Logging | ||
|
||
if __name__ == "__main__": | ||
logger = Logging.get_logger("classification training") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It could help to add more useful logs to both .py files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the name of this file BERT_Classification_EN_WIKI/wikipedia_dataset_preforclassify_20240229.csv
? What does prefor
mean?
tokenizer = AutoTokenizer.from_pretrained("liniiiiii/DistilBertForSequenceClassification_WIKI_Natural_disaster") | ||
|
||
|
||
from Database.scr.normalize_utils import Logging |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's much better to keep all imports at the top to make the code easy to read
|
||
# Classify each text in the dataset | ||
results = [] | ||
for _, row in df.iterrows(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you also mean to push wikipedia_dataset_preforclassify_20240229_classified.csv
? I have it since I ran the script to test it, so let me know and I can push it to the branch.
@i-be-snek , this is in low priority, if you have time, just have a look of the code, and I want to add it in the main branch because we have it described in the paper, thanks!