Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add the script and original result of keywords searching in EN WIKI #98

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

liniiiiii
Copy link
Collaborator

@i-be-snek , this is also low priority, if you have time next week, please help to have a look of the code, and we can merge it to the main, thanks!

@liniiiiii liniiiiii requested a review from i-be-snek September 5, 2024 08:30
@liniiiiii liniiiiii self-assigned this Sep 5, 2024
@liniiiiii liniiiiii linked an issue Sep 5, 2024 that may be closed by this pull request
3 tasks
Copy link
Collaborator

@i-be-snek i-be-snek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left some suggestions, please have a look!

from Database.scr.normalize_utils import Logging

if __name__ == "__main__":
logger = Logging.get_logger("keywords_searching")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keyword_search might be a more commonly used phrase to refer to this kind of filtered keyword search. I recommend you change it to that, including the filename.

# Initialize an empty list for each unique keyword
keywords_urls[keyword] = []
# Create the search URL
# html = f"https://en.wikipedia.org/w/index.php?title=Special:Search&limit=5000&offset=0&ns0=1&search=intitle%3A{keyword}&advancedSearch-current={%22fields%22:{%22intitle%22:%22{keyword}%22}}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this commented out bit needed?


import pandas as pd
import requests
from bs4 import BeautifulSoup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to install bs4 via poetry. Please do that and commit the .toml and .lock files.

df_keywords = pd.DataFrame(dict([(k, pd.Series(v)) for k, v in keywords.items()]))

keywords_urls = {}
for column in df_keywords.columns:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be useful to add more logger.info() statements here to let the user know what is happening. It took the script 15 minutes to run, so it would help to add more logs so that if something goes wrong, one could know where exactly.

}

keyphrases_wildfire = {
"keyphrases": [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "keyphrase" key here is redundant because each keyphrase category really only represents one list anyway. To save lines and make the code more readable, I suggest:

Suggested change
"keyphrases": [
# Aggregate the keyphrases into a dictionary
keywords = {
"drought": [
"drought",
"droughts",
"dryness",
"dry spell",
"dry spells",
"rain scarcity",
"rain scarcities",
"rainfall deficit",
"rainfall deficits",
"water stress",
"water shortage",
"water shortages",
"water insecurity",
"water insecurities",
"limited water availability",
"limited water availabilities",
"scarce water resources",
"groundwater depletion",
"groundwater depletions",
"reservoir depletion",
"reservoir depletions",
],
"windstorm": [
"windstorm",
"windstorms",
"storm",
"storms",
"cyclone",
"cyclones",
"typhoon",
"typhoons",
"hurricane",
"hurricanes",
"blizzard",
"strong winds",
"low pressure",
"gale",
"gales",
"wind gust",
"wind gusts",
"tornado",
"tornadoes",
"wind",
"winds",
"lighting",
"lightings",
"thunderstorm",
"thunderstorms",
"hail",
"hails",
],
"rainfall": [
"extreme rain",
"extreme rains",
"heavy rain",
"heavy rains",
"hard rain",
"hard rains",
"torrential rain",
"torrential rains",
"extreme precipitation",
"extreme precipitations",
"heavy precipitation",
"heavy precipitations",
"torrential precipitation",
"torrential precipitations",
"cloudburst",
"cloudbursts",
],
"heatwave": [
"heatwave",
"heatwaves",
"heat wave",
"heat waves",
"extreme heat",
"hot weather",
"high temperature",
"high temperatures",
],
"flood": [
"floodwater",
"floodwaters",
"flood",
"floods",
"inundation",
"inundations",
"storm surge",
"storm surges",
"storm tide",
"storm tides",
],
"wildfire": [
"wildfire",
"forest fire",
"bushfire",
"wildland fire",
"rural fire",
"desert fire",
"grass fire",
"hill fire",
"peat fire",
"prairie fire",
"vegetation fire",
"veld fire",
],
"coldwave": [
"cold wave",
"cold waves",
"coldwave",
"coldwaves",
"cold snap",
"cold spell",
"Arctic Snap",
"low temperature",
"low temperatures",
"extreme cold",
"cold weather",
],
}

@@ -0,0 +1,7 @@
***This is the process for original keywords searching, and suitable for EN Wiki articles. ***
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README is hard to read. Here is what it looks like when it's rendered:

image

I think it would be better to use this kind of bullet syntax for better spacing in markdown:

if you type:

- this
- or that

it becomes ...

  • this
  • or that

@@ -0,0 +1,7 @@
***This is the process for original keywords searching, and suitable for EN Wiki articles. ***
[] keywords_searching_wiki.py is the script for searching
[] wiki_keyword_mining_result_2024_02_29_updated_duplicates_cleaned.csv is the result we collected by 20240229
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to mention the date and not a timestamp. Also it would be good to link the release for which this script was used.

We can look into a way to make it part of the release later since it should have been uploaded earlier, but at least we have now documented it.

@@ -0,0 +1,7 @@
***This is the process for original keywords searching, and suitable for EN Wiki articles. ***
[] keywords_searching_wiki.py is the script for searching
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to explain more here. Searching what? A short description of its functions would help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Upload the web scraping process for keywords searching
2 participants