Before put data into the impact_DB_V1.db file, the consistency checking of the data is mandatory #101

liniiiiii · 2024-09-05T12:16:06Z

This is an issue related to the database, since we divide the database into 3 levels, and the models are required to not sum up numbers for us. We can make an assumption to fill the data gap, that is to sum up the information from L2 and L3 to put in a range, rules( need double confirmation from the team)

script
rules

liniiiiii · 2024-09-05T12:43:15Z

Proposed rules
@i-be-snek , this is up to date /Ni 2024/11/03
The top rule: L1>=L2 >=L3

Admin_Areas: if the joint set of L2/L3 is larger than the one we get in L1, and automatically, the extra areas need to fill in L1
Impact categories: if L1 is NULL, but impact information is found in L2 or L3, there are several cases
a. L2 is not NULL, L3 is NULL, sum up the numbers from L2, and fill in L1
b. L2 is NULL, and L3 is not NULL
- (i) for the same admin_area, and impact in different locations, sum up the numbers and fill a new record in L2
- (ii) next, sum up the numbers from L2 to fill in L1
c. L2 and L3 are not NULL
- (i) for the same admin_area, and impact in different locations, sum up the numbers in L3 and compare with the number in L2, if the number is smaller, revise the ''Num_Min" to this number, and if the number is larger, then change the Num_max in L2 (ideally this would not happen, because we require the model get the total in L2)
- (ii) if the admin_area in L3 is not in L2, sum up the numbers and create a new record in L2
- (iii) Next, sum up the all numbers from L2, and fill in L1 with the range
time: normally the time information in L2 and L3 is missing, for any case without a year, just fill the year in L2 and L3 using the same as the L1

i-be-snek · 2024-09-06T14:08:28Z

Some comments:

what will we be doing during the "manual" check?
I agree with adding any missing areas to the L1 list "Administrative_Areas"
I'll get back to this later...
sounds good, but did we previously not say that the minimum requirement is to have a start year? But in the case you describe, L2/L3 may not have a start date but only some impact information. Am I understanding this right?

koffiworou · 2024-09-06T14:54:11Z

I agree.
I agree. We must make sure we do not count twice numbers when filling the L1 information created from L3 and L3.
I agree as well. By default, L2 and L3 time information are inferred from L1, unless otherwise specified. @i-be-snek , the minimum requirement (starting year) must be respected by L1. We must update our statement in the next version of the guidelines.

liniiiiii · 2024-09-06T15:12:40Z

I agree.

I agree. We must make sure we do not count twice numbers when filling the L1 information created from L3 and L3.

I agree as well. By default, L2 and L3 time information are inferred from L1, unless otherwise specified. @i-be-snek , the minimum requirement (starting year) must be respected by L1. We must update our statement in the next version of the guidelines.

thanks @koffiworou I think the start year is defined in the guideline we have now if I'm not mistaken, and for the 2nd point, what do you mean by counting twice?

liniiiiii · 2024-09-06T15:16:10Z

Some comments:

what will we be doing during the "manual" check?
I agree with adding any missing areas to the L1 list "Administrative_Areas"

I'll get back to this later...

sounds good, but did we previously not say that the minimum requirement is to have a start year? But in the case you describe, L2/L3 may not have a start date but only some impact information. Am I understanding this right?

@i-be-snek thanks, 1st point, the manual checking just to see where the error comes from, and for the updated version, nothing to do with the database correction. 3rd point, yes, I mean we try to provide time information at L2/L3 when the time is not avaiable, it's not due to it's not mandatory, so we can choose to leave it like that, but we can at least fill the year for it, that's the initial idea.

i-be-snek · 2024-09-11T07:46:10Z

Here is the list written by @jnivre during our meeting yesterday

Make sure that all countries mentioned at L2 or L3 are also listed on L1.
If |countries at L2| = |countries at L3|, adjust range at L2 upwards or downwards depending on sum from L3.
If |countries at L2| > |countries at L3|, only adjust range upwards at L2 (if sum from L3 is larger).
If |countries at L1| = |countries at L2|, adjust range at L1 upwards or downwards depending on sum from L2.
If |countries at L1| > |countries at L2|, only adjust range upwards at L1 (if sum from L2 is larger).

i-be-snek · 2024-09-25T09:30:08Z

I'll be picking this task up

@liniiiiii just to confirm, is this a step that we should do before evaluation? or only for inserting data into the database?

liniiiiii · 2024-09-25T12:38:19Z

I'll be picking this task up

@liniiiiii just to confirm, is this a step that we should do before evaluation? or only for inserting data into the database?

@i-be-snek , we need to do it after evaluation, only for inserting data into the database

i-be-snek · 2024-09-26T07:24:32Z

@liniiiiii Okay, that's good.
So for now, we have the full eval pipeline in place and this will be a later step.
I have to dedicate some time to another project at the moment so I can get to this task early next week.

liniiiiii · 2024-09-26T07:31:43Z

@liniiiiii Okay, that's good. So for now, we have the full eval pipeline in place and this will be a later step. I have to dedicate some time to another project at the moment so I can get to this task early next week.

Thanks, good to know your schedule 👍

i-be-snek · 2024-10-02T10:29:48Z

@liniiiiii

I noticed that in the gold, most l2 and l3 actually have a date. If I'm not mistaken, this is a lot of times the start and end time of the event as a whole (l1). However, the llm outputs we are getting now don't have a date. This can cause problems when trying to match events because they would be penalized for having no dates.

Perhaps the date filling part of this should actually come before the evaluation? Or perhaps the model could be prompted to use the l1 start and end dates?

liniiiiii · 2024-10-02T13:47:15Z

@liniiiiii

I noticed that in the gold, most l2 and l3 actually have a date. If I'm not mistaken, this is a lot of times the start and end time of the event as a whole (l1). However, the llm outputs we are getting now don't have a date. This can cause problems when trying to match events because they would be penalized for having no dates.

Perhaps the date filling part of this should actually come before the evaluation? Or perhaps the model could be prompted to use the l1 start and end dates?

I manually checked many single evaluation results, the date is not the main issue of the llm output, but for the location fields, the model generates more locations then the human, for example, the L2 affected category, we have 10 records in the gold, but the model has 140 records, so, adjusting the prompt for the date would not be super helpful in this case.

liniiiiii · 2024-10-08T14:45:35Z

In the visualization process, I find there are some geojsons of finer scale in Adminstrative_Area/s which need to filter, and also the way around, in the Locations, there may include countries, which is not applicable, in this step, we need to filter them before inserting to the db file

i-be-snek · 2024-10-09T07:49:21Z

In the visualization process, I find there are some geojsons of finer scale in Adminstrative_Area/s which need to filter, and also the way around, in the Locations, there may include countries, which is not applicable, in this step, we need to filter them before inserting to the db file

The current approach we are using is to look at the GIDs and extract the countries based on the level_0 GID. So there is a filter in place. Can you show me some examples where the geojsons contain locations instead of countries?

i-be-snek · 2024-11-05T13:36:45Z

Consider these two examples:

Working example:

BEFORE DATA GAP FIX

L1

ID	Deaths_Min	Deaths_Max	Deaths_Approx	Areas
1	NULL	NULL	NULL	[Germany, Belgium, France]

L2

ID	Deaths_Min	Deaths_Max	Deaths_Approx	Areas
1	2	6	True	[Germany]
1	70	70	False	[France, Belgium]
1	2	6	True	[Austria]

L3

ID	Deaths_Min	Deaths_Max	Deaths_Approx	Areas
1	2	5	True	Austria -> [Salzburg]
1	60	67	True	Switzerland -> [Bern, Geneva]
1	16	20	True	Germany -> [Berlin, Stuttgart]
1	1	1	False	Germany -> [Cologne]

AFTER DATA GAP FIX

L1

ID	Deaths_Min	Deaths_Max	Deaths_Approx	Areas
1	$${\color{red}149}$$	$${\color{red}164}$$	$${\color{red}True}$$	[Germany, Belgium, France, $${\color{red}Austria}$$, $${\color{red}Switzerland}$$]

L2

ID	Deaths_Min	Deaths_Max	Deaths_Approx	Areas
1	$${\color{red}17}$$	$${\color{red}21}$$	$${\color{red}True}$$	$${\color{red}[Germany]}$$
1	70	70	False	[France, Belgium]
1	2	6	True	[Austria]
1	$${\color{red}60}$$	$${\color{red}67}$$	$${\color{red}True}$$	$${\color{red}[Switzerland]}$$

L3 (unchanged)

ID	Deaths_Min	Deaths_Max	Deaths_Approx	Areas
1	2	5	True	Austria -> [Salzburg]
1	60	67	True	Switzerland -> [Bern, Geneva]
1	16	20	True	Germany -> [Berlin, Stuttgart]
1	1	1	False	Germany -> [Cologne]

The parts in $${\color{red}red}$$ are the ones that changed. Here:

we create a new record for Switzerland in L2 because it's only found in L3
we revise the numbers for L2's record in Germany by adding the numbers from L3 (16+1 for max, 20+1 for min); since the total in L2 is smaller than that in L3, we assume data is missing and fill upwards.
we sum up all the numbers from L2 and replace the NULL values in L1 with these numbers from L2 -- we also append the extra locations: Switzerland and Austria.

Problematic example

Now consider the same example, only we have the "special case" where L2 has entries with multiple countries:

L1

ID	Deaths_Min	Deaths_Max	Deaths_Approx	Areas
1	NULL	NULL	NULL	[Germany, Belgium, France]

L2

ID	Deaths_Min	Deaths_Max	Deaths_Approx	Areas
1	2	6	True	[Germany]
1	70	70	False	[ $${\color{green}Germany}$$, Belgium]
1	2	6	True	[Austria]

L3

ID	Deaths_Min	Deaths_Max	Deaths_Approx	Areas
1	2	5	True	Austria -> [Salzburg]
1	60	67	True	Switzerland -> [Bern, Geneva]
1	16	20	True	Germany -> [Berlin, Stuttgart]
1	1	1	False	Germany -> [Cologne]

This is exactly the same example except that one L2 record happened in Belgium and Germany (in $${\color{green}green}$$), instead of Belgium and France. This is a case we are supposed to ignore.

@liniiiiii
My questions are:

The total for [Germany] in L2 is smaller than the combined sum in L3, but we have a record with 2 countries ([ $${\color{green}Germany}$$, Belgium]). How do we sum things up now?
Could you show me what the output from the second example should look like after applying the data gap fix?

liniiiiii · 2024-11-05T15:06:54Z

@i-be-snek , thanks for giving the example, I checked the raw output we have now, 13719 out of 13839 records in L2 are single country entry, for the questions you mentioned

The total for [Germany] in L2 is smaller than the combined sum in L3, but we have a record with 2 countries ([ $${\color{green}Germany}$$, Belgium]). How do we sum things up now? -- we don't deal with them, so we just leave it as the original from what the model capture from the text
Could you show me what the output from the second example should look like after applying the data gap fix? -- As you see below, first, from L3 to L2, update Germany record in L2, and add a record of Switzerland in L2, for Austria, keep the same because L3 is in the range of it. Then, from L2 to L1, neglect the multi countries, and sum up the single country entries to replace the ``NULL'' in L1 to avoid double counting, thanks!

Problematic example after fix, example by Ni

Now consider the same example, only we have the "special case" where L2 has entries with multiple countries:

L1

ID	Deaths_Min	Deaths_Max	Deaths_Approx	Areas
1	$${\color{red}79} $$	$${\color{red}94}$$	True	[Germany, Belgium, France, Austria, Switzerland]

L2

ID	Deaths_Min	Deaths_Max	Deaths_Approx	Areas
1	$${\color{red}17}$$	$${\color{red}21}$$	True	[Germany]
1	70	70	False	[ $${\color{green}Germany}$$, Belgium] $${\color{red} StayTheSame}$$
1	2	6	True	[Austria]
1	$${\color{red}60}$$	$${\color{red}67}$$	True	$${\color{red}Switzerland}$$

L3

ID	Deaths_Min	Deaths_Max	Deaths_Approx	Areas
1	2	5	True	Austria -> [Salzburg]
1	60	67	True	Switzerland -> [Bern, Geneva]
1	16	20	True	Germany -> [Berlin, Stuttgart]
1	1	1	False	Germany -> [Cologne]

liniiiiii · 2024-11-05T15:23:12Z

Let's just nelgect these cases, with only 0.86% in L2, Entries with more than one item in 'Administrative_Areas_Norm':

      Event_ID            Administrative_Areas_Norm  \
899    uMv9v6y                ['Portugal', 'Spain']   
1175   pQPzcLu          ['United States', 'Canada']   
1343   DVvfgTv          ['United States', 'Canada']   
1350   vaO12VB          ['United States', 'Canada']   
1458   ZfIdxiD          ['United States', 'Mexico']   
...        ...                                  ...   
13206  qWwg1p0  ['Philippines', 'China', 'Vietnam']   
13320  kB2ttu1    ['Indian subcontinent', 'Middle']   
13706  mYdbGR0               ['Sri Lanka', 'India']   
13748  CylcsoW                ['Canada', 'Bermuda']   
13831  PuhFhbD          ['United States', 'Mexico']

i-be-snek · 2024-11-06T09:13:04Z

I checked the raw output we have now, 13719 out of 13839 records in L2 are single country entry, for the questions you mentioned

Perfect, then it's possible to ignore them easily :)
Thank you so much for taking the time to dig into this.

liniiiiii self-assigned this Sep 5, 2024

liniiiiii added the IN PROGRESS label Sep 5, 2024

i-be-snek added the impactDB v1 Relating to the first release of ImpactDB (frozen schema and guidelines) label Sep 5, 2024

liniiiiii mentioned this issue Sep 20, 2024

Start a new branch to run evaluation results fro V3-V5 ( will close this and open a new branch to continue, because the stash process give error for the rebasing) #116

Closed

7 tasks

i-be-snek linked a pull request Oct 17, 2024 that will close this issue

Data Gap #174

Merged

12 tasks

i-be-snek mentioned this issue Nov 3, 2024

Data Gap #174

Merged

12 tasks

i-be-snek closed this as completed in #174 Nov 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Before put data into the impact_DB_V1.db file, the consistency checking of the data is mandatory #101

Before put data into the impact_DB_V1.db file, the consistency checking of the data is mandatory #101

liniiiiii commented Sep 5, 2024 •

edited by koffiworou

Loading

liniiiiii commented Sep 5, 2024 •

edited

Loading

i-be-snek commented Sep 6, 2024

koffiworou commented Sep 6, 2024

liniiiiii commented Sep 6, 2024

liniiiiii commented Sep 6, 2024

i-be-snek commented Sep 11, 2024

i-be-snek commented Sep 25, 2024

liniiiiii commented Sep 25, 2024

i-be-snek commented Sep 26, 2024

liniiiiii commented Sep 26, 2024

i-be-snek commented Oct 2, 2024 •

edited

Loading

liniiiiii commented Oct 2, 2024

liniiiiii commented Oct 8, 2024

i-be-snek commented Oct 9, 2024

i-be-snek commented Nov 5, 2024

liniiiiii commented Nov 5, 2024

liniiiiii commented Nov 5, 2024

i-be-snek commented Nov 6, 2024

Before put data into the impact_DB_V1.db file, the consistency checking of the data is mandatory #101

Before put data into the impact_DB_V1.db file, the consistency checking of the data is mandatory #101

Comments

liniiiiii commented Sep 5, 2024 • edited by koffiworou Loading

liniiiiii commented Sep 5, 2024 • edited Loading

i-be-snek commented Sep 6, 2024

koffiworou commented Sep 6, 2024

liniiiiii commented Sep 6, 2024

liniiiiii commented Sep 6, 2024

i-be-snek commented Sep 11, 2024

i-be-snek commented Sep 25, 2024

liniiiiii commented Sep 25, 2024

i-be-snek commented Sep 26, 2024

liniiiiii commented Sep 26, 2024

i-be-snek commented Oct 2, 2024 • edited Loading

liniiiiii commented Oct 2, 2024

liniiiiii commented Oct 8, 2024

i-be-snek commented Oct 9, 2024

i-be-snek commented Nov 5, 2024

Working example:

BEFORE DATA GAP FIX

AFTER DATA GAP FIX

Problematic example

liniiiiii commented Nov 5, 2024

Problematic example after fix, example by Ni

liniiiiii commented Nov 5, 2024

i-be-snek commented Nov 6, 2024

liniiiiii commented Sep 5, 2024 •

edited by koffiworou

Loading

liniiiiii commented Sep 5, 2024 •

edited

Loading

i-be-snek commented Oct 2, 2024 •

edited

Loading