-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
add the script and original result of keywords searching in EN WIKI #98
base: main
Are you sure you want to change the base?
add the script and original result of keywords searching in EN WIKI #98
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some suggestions, please have a look!
from Database.scr.normalize_utils import Logging | ||
|
||
if __name__ == "__main__": | ||
logger = Logging.get_logger("keywords_searching") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think keyword_search
might be a more commonly used phrase to refer to this kind of filtered keyword search. I recommend you change it to that, including the filename.
# Initialize an empty list for each unique keyword | ||
keywords_urls[keyword] = [] | ||
# Create the search URL | ||
# html = f"https://en.wikipedia.org/w/index.php?title=Special:Search&limit=5000&offset=0&ns0=1&search=intitle%3A{keyword}&advancedSearch-current={%22fields%22:{%22intitle%22:%22{keyword}%22}}" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this commented out bit needed?
|
||
import pandas as pd | ||
import requests | ||
from bs4 import BeautifulSoup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You forgot to install bs4 via poetry. Please do that and commit the .toml
and .lock
files.
df_keywords = pd.DataFrame(dict([(k, pd.Series(v)) for k, v in keywords.items()])) | ||
|
||
keywords_urls = {} | ||
for column in df_keywords.columns: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could be useful to add more logger.info()
statements here to let the user know what is happening. It took the script 15 minutes to run, so it would help to add more logs so that if something goes wrong, one could know where exactly.
} | ||
|
||
keyphrases_wildfire = { | ||
"keyphrases": [ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The "keyphrase" key here is redundant because each keyphrase category really only represents one list anyway. To save lines and make the code more readable, I suggest:
"keyphrases": [ | |
# Aggregate the keyphrases into a dictionary | |
keywords = { | |
"drought": [ | |
"drought", | |
"droughts", | |
"dryness", | |
"dry spell", | |
"dry spells", | |
"rain scarcity", | |
"rain scarcities", | |
"rainfall deficit", | |
"rainfall deficits", | |
"water stress", | |
"water shortage", | |
"water shortages", | |
"water insecurity", | |
"water insecurities", | |
"limited water availability", | |
"limited water availabilities", | |
"scarce water resources", | |
"groundwater depletion", | |
"groundwater depletions", | |
"reservoir depletion", | |
"reservoir depletions", | |
], | |
"windstorm": [ | |
"windstorm", | |
"windstorms", | |
"storm", | |
"storms", | |
"cyclone", | |
"cyclones", | |
"typhoon", | |
"typhoons", | |
"hurricane", | |
"hurricanes", | |
"blizzard", | |
"strong winds", | |
"low pressure", | |
"gale", | |
"gales", | |
"wind gust", | |
"wind gusts", | |
"tornado", | |
"tornadoes", | |
"wind", | |
"winds", | |
"lighting", | |
"lightings", | |
"thunderstorm", | |
"thunderstorms", | |
"hail", | |
"hails", | |
], | |
"rainfall": [ | |
"extreme rain", | |
"extreme rains", | |
"heavy rain", | |
"heavy rains", | |
"hard rain", | |
"hard rains", | |
"torrential rain", | |
"torrential rains", | |
"extreme precipitation", | |
"extreme precipitations", | |
"heavy precipitation", | |
"heavy precipitations", | |
"torrential precipitation", | |
"torrential precipitations", | |
"cloudburst", | |
"cloudbursts", | |
], | |
"heatwave": [ | |
"heatwave", | |
"heatwaves", | |
"heat wave", | |
"heat waves", | |
"extreme heat", | |
"hot weather", | |
"high temperature", | |
"high temperatures", | |
], | |
"flood": [ | |
"floodwater", | |
"floodwaters", | |
"flood", | |
"floods", | |
"inundation", | |
"inundations", | |
"storm surge", | |
"storm surges", | |
"storm tide", | |
"storm tides", | |
], | |
"wildfire": [ | |
"wildfire", | |
"forest fire", | |
"bushfire", | |
"wildland fire", | |
"rural fire", | |
"desert fire", | |
"grass fire", | |
"hill fire", | |
"peat fire", | |
"prairie fire", | |
"vegetation fire", | |
"veld fire", | |
], | |
"coldwave": [ | |
"cold wave", | |
"cold waves", | |
"coldwave", | |
"coldwaves", | |
"cold snap", | |
"cold spell", | |
"Arctic Snap", | |
"low temperature", | |
"low temperatures", | |
"extreme cold", | |
"cold weather", | |
], | |
} |
@@ -0,0 +1,7 @@ | |||
***This is the process for original keywords searching, and suitable for EN Wiki articles. *** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@@ -0,0 +1,7 @@ | |||
***This is the process for original keywords searching, and suitable for EN Wiki articles. *** | |||
[] keywords_searching_wiki.py is the script for searching | |||
[] wiki_keyword_mining_result_2024_02_29_updated_duplicates_cleaned.csv is the result we collected by 20240229 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be good to mention the date and not a timestamp. Also it would be good to link the release for which this script was used.
We can look into a way to make it part of the release later since it should have been uploaded earlier, but at least we have now documented it.
@@ -0,0 +1,7 @@ | |||
***This is the process for original keywords searching, and suitable for EN Wiki articles. *** | |||
[] keywords_searching_wiki.py is the script for searching |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be nice to explain more here. Searching what? A short description of its functions would help.
@i-be-snek , this is also low priority, if you have time next week, please help to have a look of the code, and we can merge it to the main, thanks!