Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

process full run database (to do branch) #173

Closed
6 tasks done
liniiiiii opened this issue Oct 17, 2024 · 5 comments · Fixed by #181, #183, #174, #184 or #179
Closed
6 tasks done

process full run database (to do branch) #173

liniiiiii opened this issue Oct 17, 2024 · 5 comments · Fixed by #181, #183, #174, #184 or #179
Assignees
Labels
impactDB v1 Relating to the first release of ImpactDB (frozen schema and guidelines) IN PROGRESS

Comments

@liniiiiii
Copy link
Collaborator

liniiiiii commented Oct 17, 2024

with the fullrun experiment finished, we have the raw parsed output. To make the database consistent, and provide a version that user can directly use, we will process the database with follows

  • some events have country1 or country2, they are not capturing real events (resolved in 173 filter invalid area names #184)

  • The GID column is sometimes None but it should be [] (resolved in GID missing in L3 #179)

  • Only allow specific values of event types, anything outside that list should be removed before being placed into the database (resolved in 173 Validate categorical fields Main_Event and Hazards #181)

  • currency conversion and inflation adjusted for L1-L3 (resolved in Currency Conversion and Inflation Adjustment #180 Currency Convesion and Inflation Adjustment (db release) #191)

  • captured "null" location (edit: added 20/oct/2024) -- may be a systematic error. (resolved in 173 null dam error #183)

  • the data gap, make sure all fields, L1>=L2>=L3 (resolved in Data Gap  #174)

  • Time, the *_Year in L1 should cover L2 and L3, eg, *_Year in L1 are 2020,2021, L2 and L3 could not have *_Year 2019,2023

  • Location, the Admin_Areas in L1, should cover all Admin_Areas in L2, and Admin_Areas in L2 should cover all Admin_Area in L3

  • impact values, the *_Min in L1 should be smaller or equal to sum of *_Min in L2, and the *_Max in L1 should be larger or equal to the sum of *_Max in L2. The *_Min in L2 of one Admin_Area (ignore the record where several countries have one value of impact) should be smaller or equal to sum of *_Min in the same Admin_Area in L3, and the The *_Max in L2 of one Admin_Area (ignore the record where several countries have one value of impact) should be larger or equal to sum of *_Max in the same Admin_Area in L3.

@liniiiiii
Copy link
Collaborator Author

The progress so far (17/10/2024)

  1. collected Inflation adjusted table, same as EM-DAT
  2. collected EUR-USD conversion table
  3. find two main events out of our design

@liniiiiii liniiiiii added impactDB v1 Relating to the first release of ImpactDB (frozen schema and guidelines) IN PROGRESS labels Oct 18, 2024
@liniiiiii liniiiiii mentioned this issue Oct 18, 2024
12 tasks
@i-be-snek
Copy link
Collaborator

i-be-snek commented Oct 19, 2024

Here are my comments on the todo list.

(1)

[ ] some events have country1 or country2, they are not capturing real events

This could best be tackled at the stage where we insert data into the database insert_events.py. I think those events are from articles that are not climate articles, and they would mostly have NULL throughout. We need to find a precise way to drop them. This deserves its own branch.. You can assign this to me.

(2)

[ ] The GID column is sometimes None but it should be []

I can take this part since it's related to the data type. I need to trace where this problem starts but have some good idea about that. This deserves its own branch.

(3)

[ ] Only allow specific values of event types, anything outside that list should be removed before being placed into the database

This should also be in insert_events.py. This deserves its own branch, or to be merged with number 1 on this list (this is basically called 'validation').

(4)

[ ] currency conversion and inflation adjusted for L1-L3

This already has its own branch and has already been assigned to @i-be-snek though I want to re-iterate that it will take me some time because I have never worked with inflation and honestly I'm a bit allergic to anything related to the economy at large. I do not have an estimate.

(5)

[ ] the data gap, make sure all fields, L1>=L2>=L3

  • Time, the *_Year in L1 should cover L2 and L3, eg, *_Year in L1 are 2020,2021, L2 and L3 could not have *_Year 2019,2023
  • Location, the Admin_Areas in L1, should cover all Admin_Areas in L2, and Admin_Areas in L2 should cover all Admin_Area in L3
  • impact values, the *_Min in L1 should be smaller or equal to sum of *_Min in L2, and the *_Max in L1 should be larger or equal to the sum of *_Max in L2. The *_Min in L2 of one Admin_Area (ignore the record where several countries have one value of impact) should be smaller or equal to sum of *_Min in the same Admin_Area in L3, and the The *_Max in L2 of one Admin_Area (ignore the record where several countries have one value of impact) should be larger or equal to sum of *_Max in the same Admin_Area in L3.

Already in its own branch #174 and already assigned to @i-be-snek.

@liniiiiii are you available to review these as they drop?

@liniiiiii
Copy link
Collaborator Author

liniiiiii commented Oct 19, 2024 via email

@liniiiiii
Copy link
Collaborator Author

@i-be-snek, in the first task, is it possible to filter this outlier as shown in the screenshot? I suppose it's from the "NULL" response from the model. I found a record where the Administrative_Area_Norm is Bahamas, and the Location_Norm is t-c-null-dam. I manually checked this dam, which is in the US https://mapcarta.com/20531080. And in the full-run database, I saw 4175 t-c-null-dam records, which is not realistic. Thanks!
image

@i-be-snek
Copy link
Collaborator

@liniiiiii

I'll be working on this today.

@i-be-snek i-be-snek linked a pull request Oct 25, 2024 that will close this issue
@i-be-snek i-be-snek reopened this Oct 29, 2024
@i-be-snek i-be-snek linked a pull request Oct 29, 2024 that will close this issue
@i-be-snek i-be-snek reopened this Oct 31, 2024
@i-be-snek i-be-snek pinned this issue Oct 31, 2024
@liniiiiii liniiiiii unpinned this issue Nov 4, 2024
@liniiiiii liniiiiii pinned this issue Nov 4, 2024
@i-be-snek i-be-snek reopened this Nov 25, 2024
@liniiiiii liniiiiii unpinned this issue Jan 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment