Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Before put data into the impact_DB_V1.db file, the consistency checking of the data is mandatory #101

Closed
2 tasks
liniiiiii opened this issue Sep 5, 2024 · 18 comments · Fixed by #174
Closed
2 tasks
Assignees
Labels
impactDB v1 Relating to the first release of ImpactDB (frozen schema and guidelines) IN PROGRESS

Comments

@liniiiiii
Copy link
Collaborator

liniiiiii commented Sep 5, 2024

This is an issue related to the database, since we divide the database into 3 levels, and the models are required to not sum up numbers for us. We can make an assumption to fill the data gap, that is to sum up the information from L2 and L3 to put in a range, rules( need double confirmation from the team)

  • script
  • rules
@liniiiiii liniiiiii self-assigned this Sep 5, 2024
@liniiiiii
Copy link
Collaborator Author

liniiiiii commented Sep 5, 2024

Proposed rules
@i-be-snek , this is up to date /Ni 2024/11/03
The top rule: L1>=L2 >=L3

  1. Admin_Areas: if the joint set of L2/L3 is larger than the one we get in L1, and automatically, the extra areas need to fill in L1

  2. Impact categories: if L1 is NULL, but impact information is found in L2 or L3, there are several cases
    a. L2 is not NULL, L3 is NULL, sum up the numbers from L2, and fill in L1
    b. L2 is NULL, and L3 is not NULL

    • (i) for the same admin_area, and impact in different locations, sum up the numbers and fill a new record in L2
    • (ii) next, sum up the numbers from L2 to fill in L1

    c. L2 and L3 are not NULL

    • (i) for the same admin_area, and impact in different locations, sum up the numbers in L3 and compare with the number in L2, if the number is smaller, revise the ''Num_Min" to this number, and if the number is larger, then change the Num_max in L2 (ideally this would not happen, because we require the model get the total in L2)
    • (ii) if the admin_area in L3 is not in L2, sum up the numbers and create a new record in L2
    • (iii) Next, sum up the all numbers from L2, and fill in L1 with the range
  3. time: normally the time information in L2 and L3 is missing, for any case without a year, just fill the year in L2 and L3 using the same as the L1

@i-be-snek i-be-snek added the impactDB v1 Relating to the first release of ImpactDB (frozen schema and guidelines) label Sep 5, 2024
@i-be-snek
Copy link
Collaborator

Some comments:

  1. what will we be doing during the "manual" check?
    I agree with adding any missing areas to the L1 list "Administrative_Areas"

  2. I'll get back to this later...

  3. sounds good, but did we previously not say that the minimum requirement is to have a start year? But in the case you describe, L2/L3 may not have a start date but only some impact information. Am I understanding this right?

@koffiworou
Copy link
Collaborator

  1. I agree.
  2. I agree. We must make sure we do not count twice numbers when filling the L1 information created from L3 and L3.
  3. I agree as well. By default, L2 and L3 time information are inferred from L1, unless otherwise specified. @i-be-snek , the minimum requirement (starting year) must be respected by L1. We must update our statement in the next version of the guidelines.

@liniiiiii
Copy link
Collaborator Author

  1. I agree.
  2. I agree. We must make sure we do not count twice numbers when filling the L1 information created from L3 and L3.
  3. I agree as well. By default, L2 and L3 time information are inferred from L1, unless otherwise specified. @i-be-snek , the minimum requirement (starting year) must be respected by L1. We must update our statement in the next version of the guidelines.

thanks @koffiworou I think the start year is defined in the guideline we have now if I'm not mistaken, and for the 2nd point, what do you mean by counting twice?

@liniiiiii
Copy link
Collaborator Author

Some comments:

  1. what will we be doing during the "manual" check?
    I agree with adding any missing areas to the L1 list "Administrative_Areas"
  2. I'll get back to this later...
  3. sounds good, but did we previously not say that the minimum requirement is to have a start year? But in the case you describe, L2/L3 may not have a start date but only some impact information. Am I understanding this right?

@i-be-snek thanks, 1st point, the manual checking just to see where the error comes from, and for the updated version, nothing to do with the database correction. 3rd point, yes, I mean we try to provide time information at L2/L3 when the time is not avaiable, it's not due to it's not mandatory, so we can choose to leave it like that, but we can at least fill the year for it, that's the initial idea.

@i-be-snek
Copy link
Collaborator

Here is the list written by @jnivre during our meeting yesterday

  1. Make sure that all countries mentioned at L2 or L3 are also listed on L1.
  2. If |countries at L2| = |countries at L3|, adjust range at L2 upwards or downwards depending on sum from L3.
  3. If |countries at L2| > |countries at L3|, only adjust range upwards at L2 (if sum from L3 is larger).
  4. If |countries at L1| = |countries at L2|, adjust range at L1 upwards or downwards depending on sum from L2.
  5. If |countries at L1| > |countries at L2|, only adjust range upwards at L1 (if sum from L2 is larger).

@i-be-snek
Copy link
Collaborator

I'll be picking this task up

@liniiiiii just to confirm, is this a step that we should do before evaluation? or only for inserting data into the database?

@liniiiiii
Copy link
Collaborator Author

I'll be picking this task up

@liniiiiii just to confirm, is this a step that we should do before evaluation? or only for inserting data into the database?

@i-be-snek , we need to do it after evaluation, only for inserting data into the database

@i-be-snek
Copy link
Collaborator

@liniiiiii Okay, that's good.
So for now, we have the full eval pipeline in place and this will be a later step.
I have to dedicate some time to another project at the moment so I can get to this task early next week.

@liniiiiii
Copy link
Collaborator Author

@liniiiiii Okay, that's good. So for now, we have the full eval pipeline in place and this will be a later step. I have to dedicate some time to another project at the moment so I can get to this task early next week.

Thanks, good to know your schedule 👍

@i-be-snek
Copy link
Collaborator

i-be-snek commented Oct 2, 2024

@liniiiiii

I noticed that in the gold, most l2 and l3 actually have a date. If I'm not mistaken, this is a lot of times the start and end time of the event as a whole (l1). However, the llm outputs we are getting now don't have a date. This can cause problems when trying to match events because they would be penalized for having no dates.

Perhaps the date filling part of this should actually come before the evaluation? Or perhaps the model could be prompted to use the l1 start and end dates?

@liniiiiii
Copy link
Collaborator Author

@liniiiiii

I noticed that in the gold, most l2 and l3 actually have a date. If I'm not mistaken, this is a lot of times the start and end time of the event as a whole (l1). However, the llm outputs we are getting now don't have a date. This can cause problems when trying to match events because they would be penalized for having no dates.

Perhaps the date filling part of this should actually come before the evaluation? Or perhaps the model could be prompted to use the l1 start and end dates?

I manually checked many single evaluation results, the date is not the main issue of the llm output, but for the location fields, the model generates more locations then the human, for example, the L2 affected category, we have 10 records in the gold, but the model has 140 records, so, adjusting the prompt for the date would not be super helpful in this case.

@liniiiiii
Copy link
Collaborator Author

In the visualization process, I find there are some geojsons of finer scale in Adminstrative_Area/s which need to filter, and also the way around, in the Locations, there may include countries, which is not applicable, in this step, we need to filter them before inserting to the db file

@i-be-snek
Copy link
Collaborator

In the visualization process, I find there are some geojsons of finer scale in Adminstrative_Area/s which need to filter, and also the way around, in the Locations, there may include countries, which is not applicable, in this step, we need to filter them before inserting to the db file

The current approach we are using is to look at the GIDs and extract the countries based on the level_0 GID. So there is a filter in place. Can you show me some examples where the geojsons contain locations instead of countries?

@i-be-snek i-be-snek linked a pull request Oct 17, 2024 that will close this issue
12 tasks
@i-be-snek i-be-snek mentioned this issue Nov 3, 2024
12 tasks
@i-be-snek
Copy link
Collaborator

Consider these two examples:

Working example:

BEFORE DATA GAP FIX

L1

ID Deaths_Min Deaths_Max Deaths_Approx Areas
1 NULL NULL NULL [Germany, Belgium, France]                      

L2

ID Deaths_Min Deaths_Max Deaths_Approx Areas
1 2 6 True [Germany]
1 70 70 False [France, Belgium]                                
1 2 6 True [Austria]

L3

ID Deaths_Min Deaths_Max Deaths_Approx Areas
1 2 5 True Austria -> [Salzburg]                             
1 60 67 True Switzerland -> [Bern, Geneva]
1 16 20 True Germany -> [Berlin, Stuttgart]
1 1 1 False Germany -> [Cologne]

AFTER DATA GAP FIX

L1

ID Deaths_Min Deaths_Max Deaths_Approx Areas
1 $${\color{red}149}$$ $${\color{red}164}$$ $${\color{red}True}$$ [Germany, Belgium, France, $${\color{red}Austria}$$, $${\color{red}Switzerland}$$]    

L2

ID Deaths_Min Deaths_Max Deaths_Approx Areas
1 $${\color{red}17}$$ $${\color{red}21}$$ $${\color{red}True}$$ $${\color{red}[Germany]}$$
1 70 70 False [France, Belgium]                                  
1 2 6 True [Austria]
1 $${\color{red}60}$$ $${\color{red}67}$$ $${\color{red}True}$$ $${\color{red}[Switzerland]}$$

L3 (unchanged)

ID Deaths_Min Deaths_Max Deaths_Approx Areas
1 2 5 True Austria -> [Salzburg]                               
1 60 67 True Switzerland -> [Bern, Geneva]
1 16 20 True Germany -> [Berlin, Stuttgart]
1 1 1 False Germany -> [Cologne]

The parts in $${\color{red}red}$$ are the ones that changed. Here:

  1. we create a new record for Switzerland in L2 because it's only found in L3
  2. we revise the numbers for L2's record in Germany by adding the numbers from L3 (16+1 for max, 20+1 for min); since the total in L2 is smaller than that in L3, we assume data is missing and fill upwards.
  3. we sum up all the numbers from L2 and replace the NULL values in L1 with these numbers from L2 -- we also append the extra locations: Switzerland and Austria.

Problematic example

Now consider the same example, only we have the "special case" where L2 has entries with multiple countries:

L1

ID Deaths_Min Deaths_Max Deaths_Approx Areas
1 NULL NULL NULL [Germany, Belgium, France]                      

L2

ID Deaths_Min Deaths_Max Deaths_Approx Areas
1 2 6 True [Germany]
1 70 70 False [ $${\color{green}Germany}$$, Belgium]                                
1 2 6 True [Austria]

L3

ID Deaths_Min Deaths_Max Deaths_Approx Areas
1 2 5 True Austria -> [Salzburg]                             
1 60 67 True Switzerland -> [Bern, Geneva]
1 16 20 True Germany -> [Berlin, Stuttgart]
1 1 1 False Germany -> [Cologne]

This is exactly the same example except that one L2 record happened in Belgium and Germany (in $${\color{green}green}$$), instead of Belgium and France. This is a case we are supposed to ignore.

@liniiiiii
My questions are:

  1. The total for [Germany] in L2 is smaller than the combined sum in L3, but we have a record with 2 countries ([ $${\color{green}Germany}$$, Belgium]). How do we sum things up now?
  2. Could you show me what the output from the second example should look like after applying the data gap fix?

@liniiiiii
Copy link
Collaborator Author

@i-be-snek , thanks for giving the example, I checked the raw output we have now, 13719 out of 13839 records in L2 are single country entry, for the questions you mentioned

  1. The total for [Germany] in L2 is smaller than the combined sum in L3, but we have a record with 2 countries ([ $${\color{green}Germany}$$, Belgium]). How do we sum things up now? -- we don't deal with them, so we just leave it as the original from what the model capture from the text
  2. Could you show me what the output from the second example should look like after applying the data gap fix? -- As you see below, first, from L3 to L2, update Germany record in L2, and add a record of Switzerland in L2, for Austria, keep the same because L3 is in the range of it. Then, from L2 to L1, neglect the multi countries, and sum up the single country entries to replace the ``NULL'' in L1 to avoid double counting, thanks!

Problematic example after fix, example by Ni

Now consider the same example, only we have the "special case" where L2 has entries with multiple countries:

L1

ID Deaths_Min Deaths_Max Deaths_Approx Areas
1 $${\color{red}79} $$ $${\color{red}94}$$ True [Germany, Belgium, France, Austria, Switzerland]                      

L2

ID Deaths_Min Deaths_Max Deaths_Approx Areas
1 $${\color{red}17}$$ $${\color{red}21}$$ True [Germany]
1 70 70 False [ $${\color{green}Germany}$$, Belgium] $${\color{red} StayTheSame}$$                                
1 2 6 True [Austria]
1 $${\color{red}60}$$ $${\color{red}67}$$ True $${\color{red}Switzerland}$$

L3

ID Deaths_Min Deaths_Max Deaths_Approx Areas
1 2 5 True Austria -> [Salzburg]                             
1 60 67 True Switzerland -> [Bern, Geneva]
1 16 20 True Germany -> [Berlin, Stuttgart]
1 1 1 False Germany -> [Cologne]

@liniiiiii
Copy link
Collaborator Author

Let's just nelgect these cases, with only 0.86% in L2, Entries with more than one item in 'Administrative_Areas_Norm':

      Event_ID            Administrative_Areas_Norm  \
899    uMv9v6y                ['Portugal', 'Spain']   
1175   pQPzcLu          ['United States', 'Canada']   
1343   DVvfgTv          ['United States', 'Canada']   
1350   vaO12VB          ['United States', 'Canada']   
1458   ZfIdxiD          ['United States', 'Mexico']   
...        ...                                  ...   
13206  qWwg1p0  ['Philippines', 'China', 'Vietnam']   
13320  kB2ttu1    ['Indian subcontinent', 'Middle']   
13706  mYdbGR0               ['Sri Lanka', 'India']   
13748  CylcsoW                ['Canada', 'Bermuda']   
13831  PuhFhbD          ['United States', 'Mexico'] 

@i-be-snek
Copy link
Collaborator

I checked the raw output we have now, 13719 out of 13839 records in L2 are single country entry, for the questions you mentioned

Perfect, then it's possible to ignore them easily :)
Thank you so much for taking the time to dig into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
impactDB v1 Relating to the first release of ImpactDB (frozen schema and guidelines) IN PROGRESS
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants