-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Before put data into the impact_DB_V1.db file, the consistency checking of the data is mandatory #101
Comments
Proposed rules
|
Some comments:
|
|
thanks @koffiworou I think the start year is defined in the guideline we have now if I'm not mistaken, and for the 2nd point, what do you mean by counting twice? |
@i-be-snek thanks, 1st point, the manual checking just to see where the error comes from, and for the updated version, nothing to do with the database correction. 3rd point, yes, I mean we try to provide time information at L2/L3 when the time is not avaiable, it's not due to it's not mandatory, so we can choose to leave it like that, but we can at least fill the year for it, that's the initial idea. |
Here is the list written by @jnivre during our meeting yesterday
|
I'll be picking this task up @liniiiiii just to confirm, is this a step that we should do before evaluation? or only for inserting data into the database? |
@i-be-snek , we need to do it after evaluation, only for inserting data into the database |
@liniiiiii Okay, that's good. |
Thanks, good to know your schedule 👍 |
I noticed that in the gold, most l2 and l3 actually have a date. If I'm not mistaken, this is a lot of times the start and end time of the event as a whole (l1). However, the llm outputs we are getting now don't have a date. This can cause problems when trying to match events because they would be penalized for having no dates. Perhaps the date filling part of this should actually come before the evaluation? Or perhaps the model could be prompted to use the l1 start and end dates? |
I manually checked many single evaluation results, the date is not the main issue of the llm output, but for the location fields, the model generates more locations then the human, for example, the L2 affected category, we have 10 records in the gold, but the model has 140 records, so, adjusting the prompt for the date would not be super helpful in this case. |
In the visualization process, I find there are some geojsons of finer scale in Adminstrative_Area/s which need to filter, and also the way around, in the Locations, there may include countries, which is not applicable, in this step, we need to filter them before inserting to the db file |
The current approach we are using is to look at the GIDs and extract the countries based on the level_0 GID. So there is a filter in place. Can you show me some examples where the geojsons contain locations instead of countries? |
Consider these two examples: Working example:BEFORE DATA GAP FIXL1
L2
L3
AFTER DATA GAP FIXL1
L2
L3 (unchanged)
The parts in
Problematic exampleNow consider the same example, only we have the "special case" where L2 has entries with multiple countries: L1
L2
L3
This is exactly the same example except that one L2 record happened in Belgium and Germany (in @liniiiiii
|
@i-be-snek , thanks for giving the example, I checked the raw output we have now, 13719 out of 13839 records in L2 are single country entry, for the questions you mentioned
Problematic example after fix, example by NiNow consider the same example, only we have the "special case" where L2 has entries with multiple countries: L1
L2
L3
|
Let's just nelgect these cases, with only 0.86% in L2, Entries with more than one item in 'Administrative_Areas_Norm': Event_ID Administrative_Areas_Norm \
899 uMv9v6y ['Portugal', 'Spain']
1175 pQPzcLu ['United States', 'Canada']
1343 DVvfgTv ['United States', 'Canada']
1350 vaO12VB ['United States', 'Canada']
1458 ZfIdxiD ['United States', 'Mexico']
... ... ...
13206 qWwg1p0 ['Philippines', 'China', 'Vietnam']
13320 kB2ttu1 ['Indian subcontinent', 'Middle']
13706 mYdbGR0 ['Sri Lanka', 'India']
13748 CylcsoW ['Canada', 'Bermuda']
13831 PuhFhbD ['United States', 'Mexico'] |
Perfect, then it's possible to ignore them easily :) |
This is an issue related to the database, since we divide the database into 3 levels, and the models are required to not sum up numbers for us. We can make an assumption to fill the data gap, that is to sum up the information from L2 and L3 to put in a range, rules( need double confirmation from the team)
The text was updated successfully, but these errors were encountered: