Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add the script and original result of keywords searching in EN WIKI #98

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
7 changes: 7 additions & 0 deletions Keywords_searching/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
***This is the process for original keywords searching, and suitable for EN Wiki articles. ***
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The README is hard to read. Here is what it looks like when it's rendered:

image

I think it would be better to use this kind of bullet syntax for better spacing in markdown:

if you type:

- this
- or that

it becomes ...

  • this
  • or that

[] keywords_searching_wiki.py is the script for searching
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to explain more here. Searching what? A short description of its functions would help.

[] wiki_keyword_mining_result_2024_02_29_updated_duplicates_cleaned.csv is the result we collected by 20240229
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be good to mention the date and not a timestamp. Also it would be good to link the release for which this script was used.

We can look into a way to make it part of the release later since it should have been uploaded earlier, but at least we have now documented it.

[] reference command to run the script
```shell
poetry run python3 Keywords_searching/keywords_searching_wiki.py --filename wiki_keyword_mining_result_2024_02_29_updated_duplicates_cleaned.csv --output_dir Keywords_searching
```
242 changes: 242 additions & 0 deletions Keywords_searching/keywords_searching_wiki.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,242 @@
import argparse
import pathlib

import pandas as pd
import requests
from bs4 import BeautifulSoup
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You forgot to install bs4 via poetry. Please do that and commit the .toml and .lock files.


from Database.scr.normalize_utils import Logging

if __name__ == "__main__":
logger = Logging.get_logger("keywords_searching")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think keyword_search might be a more commonly used phrase to refer to this kind of filtered keyword search. I recommend you change it to that, including the filename.


parser = argparse.ArgumentParser()
parser.add_argument(
"-f",
"--filename",
dest="filename",
help="The name of the output file",
type=str,
)
parser.add_argument(
"-o",
"--output_dir",
dest="output_dir",
help="The directory where the output will land (as .csv)",
type=str,
)
args = parser.parse_args()
logger.info(f"Passed args: {args}")

logger.info(f"Creating {args.output_dir} if it does not exist!")
pathlib.Path(args.output_dir).mkdir(parents=True, exist_ok=True)
# Define the keyphrases
keyphrases_drought = {
"keyphrases": [
"drought",
"droughts",
"dryness",
"dry spell",
"dry spells",
"rain scarcity",
"rain scarcities",
"rainfall deficit",
"rainfall deficits",
"water stress",
"water shortage",
"water shortages",
"water insecurity",
"water insecurities",
"limited water availability",
"limited water availabilities",
"scarce water resources",
"groundwater depletion",
"groundwater depletions",
"reservoir depletion",
"reservoir depletions",
]
}

keyphrases_storm = {
"keyphrases": [
"windstorm",
"windstorms",
"storm",
"storms",
"cyclone",
"cyclones",
"typhoon",
"typhoons",
"hurricane",
"hurricanes",
"blizzard",
"strong winds",
"low pressure",
"gale",
"gales",
"wind gust",
"wind gusts",
"tornado",
"tornadoes",
"wind",
"winds",
"lighting",
"lightings",
"thunderstorm",
"thunderstorms",
"hail",
"hails",
]
}

keyphrases_rainfall = {
"keyphrases": [
"extreme rain",
"extreme rains",
"heavy rain",
"heavy rains",
"hard rain",
"hard rains",
"torrential rain",
"torrential rains",
"extreme precipitation",
"extreme precipitations",
"heavy precipitation",
"heavy precipitations",
"torrential precipitation",
"torrential precipitations",
"cloudburst",
"cloudbursts",
]
}

keyphrases_heatwave = {
"keyphrases": [
"heatwave",
"heatwaves",
"heat wave",
"heat waves",
"extreme heat",
"hot weather",
"high temperature",
"high temperatures",
]
}

keyphrases_flood = {
"keyphrases": [
"floodwater",
"floodwaters",
"flood",
"floods",
"inundation",
"inundations",
"storm surge",
"storm surges",
"storm tide",
"storm tides",
]
}

keyphrases_wildfire = {
"keyphrases": [
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The "keyphrase" key here is redundant because each keyphrase category really only represents one list anyway. To save lines and make the code more readable, I suggest:

Suggested change
"keyphrases": [
# Aggregate the keyphrases into a dictionary
keywords = {
"drought": [
"drought",
"droughts",
"dryness",
"dry spell",
"dry spells",
"rain scarcity",
"rain scarcities",
"rainfall deficit",
"rainfall deficits",
"water stress",
"water shortage",
"water shortages",
"water insecurity",
"water insecurities",
"limited water availability",
"limited water availabilities",
"scarce water resources",
"groundwater depletion",
"groundwater depletions",
"reservoir depletion",
"reservoir depletions",
],
"windstorm": [
"windstorm",
"windstorms",
"storm",
"storms",
"cyclone",
"cyclones",
"typhoon",
"typhoons",
"hurricane",
"hurricanes",
"blizzard",
"strong winds",
"low pressure",
"gale",
"gales",
"wind gust",
"wind gusts",
"tornado",
"tornadoes",
"wind",
"winds",
"lighting",
"lightings",
"thunderstorm",
"thunderstorms",
"hail",
"hails",
],
"rainfall": [
"extreme rain",
"extreme rains",
"heavy rain",
"heavy rains",
"hard rain",
"hard rains",
"torrential rain",
"torrential rains",
"extreme precipitation",
"extreme precipitations",
"heavy precipitation",
"heavy precipitations",
"torrential precipitation",
"torrential precipitations",
"cloudburst",
"cloudbursts",
],
"heatwave": [
"heatwave",
"heatwaves",
"heat wave",
"heat waves",
"extreme heat",
"hot weather",
"high temperature",
"high temperatures",
],
"flood": [
"floodwater",
"floodwaters",
"flood",
"floods",
"inundation",
"inundations",
"storm surge",
"storm surges",
"storm tide",
"storm tides",
],
"wildfire": [
"wildfire",
"forest fire",
"bushfire",
"wildland fire",
"rural fire",
"desert fire",
"grass fire",
"hill fire",
"peat fire",
"prairie fire",
"vegetation fire",
"veld fire",
],
"coldwave": [
"cold wave",
"cold waves",
"coldwave",
"coldwaves",
"cold snap",
"cold spell",
"Arctic Snap",
"low temperature",
"low temperatures",
"extreme cold",
"cold weather",
],
}

"wildfire",
"forest fire",
"bushfire",
"wildland fire",
"rural fire",
"desert fire",
"grass fire",
"hill fire",
"peat fire",
"prairie fire",
"vegetation fire",
"veld fire",
]
}

keyphrases_coldwave = {
"keyphrases": [
"cold wave",
"cold waves",
"coldwave",
"coldwaves",
"cold snap",
"cold spell",
"Arctic Snap",
"low temperature",
"low temperatures",
"extreme cold",
"cold weather",
]
}

# Aggregate the keyphrases into a dictionary
keywords = {
"flood": keyphrases_flood["keyphrases"],
"wildfire": keyphrases_wildfire["keyphrases"],
"storm": keyphrases_storm["keyphrases"],
"drought": keyphrases_drought["keyphrases"],
"heatwave": keyphrases_heatwave["keyphrases"],
"rainfall": keyphrases_rainfall["keyphrases"],
"coldwave": keyphrases_coldwave["keyphrases"],
}

# Convert the dictionary to a DataFrame for better visualization and export
df_keywords = pd.DataFrame(dict([(k, pd.Series(v)) for k, v in keywords.items()]))

keywords_urls = {}
for column in df_keywords.columns:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could be useful to add more logger.info() statements here to let the user know what is happening. It took the script 15 minutes to run, so it would help to add more logs so that if something goes wrong, one could know where exactly.

for keyword in df_keywords[column]:
if keyword is not None and keyword not in keywords_urls:
# Initialize an empty list for each unique keyword
keywords_urls[keyword] = []
# Create the search URL
# html = f"https://en.wikipedia.org/w/index.php?title=Special:Search&limit=5000&offset=0&ns0=1&search=intitle%3A{keyword}&advancedSearch-current={%22fields%22:{%22intitle%22:%22{keyword}%22}}"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this commented out bit needed?

html = f"https://en.wikipedia.org/w/index.php?title=Special:Search&limit=5000&offset=0&ns0=1&search=intitle%3A{keyword}&advancedSearch-current={{%22fields%22:{{%22intitle%22:%22{keyword}%22}}}}"

# Make the GET request to Wikipedia
resp = requests.get(html)
resp.encoding = "utf-8"

# Parse the response content with BeautifulSoup
bs = BeautifulSoup(resp.text, "html.parser")

# Find all the search result headings
for news in bs.select("div.mw-search-result-heading"):
# Construct the URL for the Wikipedia page from the search result
url = "https://en.wikipedia.org" + news.select("a")[0]["href"]

# Append the URL to the list in the dictionary for the keyword
keywords_urls[keyword].append(url)

# Create a list of dictionaries
keyword_list = []
for keyword, urls in keywords_urls.items():
for url in urls:
keyword_list.append({"keyword": keyword, "url": url})

# Find the maximum length of the lists in the dictionary
max_length = max(len(lst) for lst in keywords_urls.values())

# Pad each list in the dictionary to have the same length
for keyword, urls in keywords_urls.items():
if len(urls) < max_length:
keywords_urls[keyword].extend([None] * (max_length - len(urls)))

# Create a DataFrame for it
keywords_urls_df = pd.DataFrame(keywords_urls)
# Initialize an empty DataFrame with columns 'Keyword' and 'URL'
consolidated_urls_df = pd.DataFrame(columns=["Keyword", "URL"])

for keyword in keywords_urls_df.columns:
# Extract the column as a Series, dropping NA values which represent empty cells
urls = keywords_urls_df[keyword].dropna()
# Create a temporary DataFrame for the current keyword
temp_df = pd.DataFrame({"Keyword": keyword, "URL": urls})
# Append the temporary DataFrame to the consolidated DataFrame
consolidated_urls_df = pd.concat([consolidated_urls_df, temp_df], ignore_index=True)
# Remove duplicate URLs, keeping the first occurrence of each URL
consolidated_urls_df_unique = consolidated_urls_df.drop_duplicates(subset="URL", keep="first")
logger = Logging.get_logger("save the result")
consolidated_urls_df_unique.to_csv(f"{args.output_dir}/{args.filename}")
Git LFS file not shown
Loading