-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
48 additions
and
47 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,11 +1,13 @@ | ||
[![Run Unit Tests via Pytest](https://github.com/VUB-HYDR/Wikimpacts/actions/workflows/run_tests.yml/badge.svg)](https://github.com/VUB-HYDR/Wikimpacts/actions/workflows/run_tests.yml) | ||
|
||
# Wikimpacts | ||
Wikimapcts is the first version of climate impact dataset creating by generative AI GPT4.0 | ||
|
||
Wikimapcts is the first version of climate impact dataset creating by generative AI GPT4.0 | ||
|
||
## Dependencies | ||
|
||
Prerequisite: | ||
|
||
- Install [`poetry`](https://python-poetry.org/docs/#installation) | ||
Then activate a virtual environment and install the dependencies: | ||
|
||
|
@@ -28,15 +30,47 @@ pre-commit installed at .git/hooks/pre-commit | |
git lfs install | ||
``` | ||
|
||
## Development | ||
|
||
Always pull a fresh copy of the `main` branch first! To add a new feature, check out a new branch from the `main` branch, make changes there, and push the new branch upstream to open a PR. PRs should result in a **squash commit** in the `main` branch. **It is recommended to code responsibly and ask someone to review your code. You can always tag [i-be-snek](https://github.com/i-be-snek) as a reviewer.** | ||
|
||
Always _**rebase**_ your branch on the latest changes in `main` instead of merging using `git rebase main`. If you are having trouble with resolving merge conflicts when rebasing, consult [i-be-snek](https://github.com/i-be-snek). | ||
|
||
And don't forget to pull large files from Git Large File Storage! | ||
|
||
```shell | ||
# always pull first | ||
git pull main | ||
|
||
# fetch and merge all files for your current branch | ||
git lfs pull | ||
|
||
# fetch and merge all files for ALL branches | ||
git lfs pull --all | ||
``` | ||
|
||
> [!TIP] | ||
> Consult this [StackOverflow answer on how to use `git lfs`](https://stackoverflow.com/a/72610495/14123992) | ||
Make sure any new dependencies are handled by `poetry`.You should be tracking and pushing both `poetry.lock` and `pyproject.toml` files. | ||
There is no need to manually add dependencies to the `pyproject.toml` file. Instead, use `poetry` commands: | ||
|
||
```shell | ||
# add pandas as a main dependency | ||
poetry add pandas -G main | ||
|
||
# add a specific version of ipykernel as a dev dependency | ||
poetry add [email protected] -G dev | ||
``` | ||
|
||
## Pipeline Docs | ||
|
||
The process below describes crucial parts of our pipeline. | ||
|
||
### Running prompts and experiments on OpenAI models | ||
|
||
|
||
|
||
#### Run prompt experiments on OpenAI models | ||
|
||
If you use OpenAI models, there is a way to save your cost with running experiments in batch. | ||
We developed a series of prompts for our database as follows: | ||
|
||
|
@@ -56,6 +90,7 @@ Before you run our pipeline, please choose a version of prompts to proceed, whic | |
```shell | ||
from Database.Prompts.prompts import V_3 as target_prompts | ||
``` | ||
|
||
##### Step 1: Experiment Settings | ||
|
||
1. **Raw File Selection**: Choose a clear and descriptive filename for the raw file you want to process, such as `wiki_dev_whole_infobox_20240729_70single_events`. | ||
|
@@ -126,7 +161,6 @@ poetry run python3 Database/Prompts/batch_output_retrivel.py \ | |
--description all_categories_V3 | ||
``` | ||
|
||
|
||
### Parsing and evaluation pipeline | ||
|
||
If you have generated some LLM output and would like to test it against the dev and test gold sets, here is a list of command to enable you to experiment with this yourself. | ||
|
@@ -136,6 +170,7 @@ If you have generated some LLM output and would like to test it against the dev | |
Choose a new experiment name! You will use this <EXPERIMENT_NAME> for the whole pipeline. | ||
|
||
#### (Step 2-a (Optional prestep!)) | ||
|
||
If the system output is split across several files (such as Mixtral and Mistral system outputs), then first merge it: | ||
|
||
- Normalizing JSON output for Mistral/Mixtral | ||
|
@@ -152,7 +187,6 @@ If the system output is split across several files (such as Mixtral and Mistral | |
|
||
GPT4o sometimes produces inconsistent JSON where it nests keys like "Location" under "Location_Information" and start and end date under the key "Time_Information". In this case, you need to unnest these using the script below: | ||
|
||
|
||
```shell | ||
poetry run python3 Database/fix_json_inconsistencies.py \ | ||
-i "Database/raw/<EXPERIMENT_NAME>/<INPUT_FILE.JSON>" \ | ||
|
@@ -173,7 +207,7 @@ If the system output is split across several files (such as Mixtral and Mistral | |
> Your raw system output files should always land in the `Database/raw/<EXPERIMENT_NAME>` directory! | ||
|
||
> [!TIP] | ||
> JSON files can be formatted easily with pre-commit: | ||
> JSON files can be formatted easily with pre-commit: | ||
> | ||
> ```shell | ||
> pre-commit run --files Database/raw/<EXPERIMENT_NAME>/> <JSON_FILE_THAT_NEEDS_FORMATTING> | ||
|
@@ -197,7 +231,6 @@ poetry run python3 Database/parse_events.py --help | |
``` | ||
|
||
|
||
You can also parse l1 first and store a row file. Later on, you can parse l2 or l3 from the same file without having to re-tun all of the parsing for l1. This allows splitting jobs. You can try the example below on dummy data. It's best to delete the contents of [Database/output/dummy](Database/output/dummy) first to re-run the example! | ||
```shell | ||
|
@@ -228,10 +261,10 @@ poetry run python3 Database/parse_events.py \ | |
> [!WARNING] | ||
> Normalizing countries will go slow the first time. This is because we are using a free API (currently!). However, each time this script is run locally, geopy will cache the results, meaning that it will go faster the next time you run it on your local branch. Allow for 15-20 minutes the first time. | ||
#### (Step 2) Evaluate against the dev and test sets | ||
##### (A) Choose your config and columns | ||
The python dictionary in <a href="Evaluation/weights.py"><code>weights.py</code></a> contains different weight configs. For example, the experiments nlp4climate weighs all the column types equally but excludes the "Event_Name" from evaluation. | ||
Also, this config will result in evaluating only on this smaller set of columns, so this list also functions as a set of columns that will be included in the evaluation script for this experiment. | ||
|
@@ -262,8 +295,8 @@ Also, this config will result in evaluating only on this smaller set of columns, | |
} | ||
``` | ||
##### (B) Evaluate L1 (Total Summary) events | ||
When your config is ready, run the evaluation script: | ||
```shell | ||
|
@@ -297,6 +330,7 @@ poetry run python3 Evaluation/evaluator.py \ | |
--weights_config specific_instance \ | ||
--impact_type deaths | ||
``` | ||
If run properly, you should see the results in `Database/evaluation_results/specific_instance_eval_test`: | ||
```shell | ||
|
@@ -313,7 +347,6 @@ Database/evaluation_results/specific_instance_eval_test | |
> [!WARNING] | ||
> Do not commit these files to your branch or to `main`, big thanks! | ||
### Evaluating L1, L2, and L3 in a single run | ||
To evaluate all event levels and categories, **make sure your gold files and sys output files are in the correct structure as described in [Evaluation/evaluate_full_run.sh](Evaluation/evaluate_full_run.sh)**, then run this script: | ||
|
@@ -374,6 +407,7 @@ poetry run python3 Database/fill_data_gap.py -i Database/output/full_run_25_dedu | |
``` | ||
### Database-related | ||
- To generate the database according to [`Database/schema.sql`](Database/schema.sql): | ||
```shell | ||
|
@@ -382,7 +416,7 @@ poetry run python3 Database/fill_data_gap.py -i Database/output/full_run_25_dedu | |
#### SPECIAL USECASE: Converting the manual annotation table from a flat format to Events and Specific Impacts | ||
1. Download the latest copy of the excel sheet. *The excel sheet must have the column names described in `Database/gold/ImpactDB_DataTable_Validation.xlsx` sheet `ImpactDB_v2_gold_template`.* | ||
1. Download the latest copy of the excel sheet. _The excel sheet must have the column names described in `Database/gold/ImpactDB_DataTable_Validation.xlsx` sheet `ImpactDB_v2_gold_template`._ | ||
2. Choose the correct excel sheet and run the script: | ||
```shell | ||
|
@@ -397,40 +431,6 @@ poetry run python3 Database/gold_from_excel.py \ | |
> Please don't track or push excel sheets into the repository | ||
> The file `Database/gold/ImpactDB_DataTable_Validation.xlsx` has the latest gold annotations from 01/06/2024 and will be updated in the future. | ||
### Develop | ||
Always pull a fresh copy of the `main` branch first! To add a new feature, check out a new branch from the `main` branch, make changes there, and push the new branch upstream to open a PR. PRs should result in a **squash commit** in the `main` branch. **It is recommended to code responsibly and ask someone to review your code. You can always tag [i-be-snek](https://github.com/i-be-snek) as a reviewer.** | ||
Always _**rebase**_ your branch on the latest changes in `main` instead of merging using `git rebase main`. If you are having trouble with resolving merge conflicts when rebasing, consult [i-be-snek](https://github.com/i-be-snek). | ||
And don't forget to pull large files from Git Large File Storage! | ||
``` | ||
# always pull first | ||
git pull main | ||
# fetch and merge all files for your current branch | ||
git lfs pull | ||
# fetch and merge all files for ALL branches | ||
git lfs pull --all | ||
``` | ||
> [!TIP] | ||
> Consult this [StackOverflow answer on how to use `git lfs`](https://stackoverflow.com/a/72610495/14123992) | ||
Make sure any new dependencies are handled by `poetry`.You should be tracking and pushing both `poetry.lock` and `pyproject.toml` files. | ||
There is no need to manually add dependencies to the `pyproject.toml` file. Instead, use `poetry` commands: | ||
```shell | ||
# add pandas as a main dependency | ||
poetry add pandas -G main | ||
# add a specific version of ipykernel as a dev dependency | ||
poetry add [email protected] -G dev | ||
``` | ||
### Problems? | ||
Start an Issue on GitHub if you find a bug in the code or have suggestions for a feature you need. | ||
|
@@ -440,11 +440,12 @@ If you run into an error or problem, please include the error trace or logs! :D | |
> Consult this [Github Cheat Sheet](https://education.github.com/git-cheat-sheet-education.pdf) | ||
### Sources & Citations | ||
- GADM world data | `Database/data/gadm_world.csv` | ||
https://gadm.org/license.html | ||
<https://gadm.org/license.html> | ||
- Regions by UNSD | `Database/data/UNSD — Methodology.csv` | ||
United Nations Standard Country Code, Series M: Miscellaneous Statistical Papers, No. 49, New York: United Nations. ST/ESA/STAT/SER.M/49 | ||
https://unstats.un.org/unsd/classifications/Family/Detail/12 | ||
<https://unstats.un.org/unsd/classifications/Family/Detail/12> |