Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DRAFT: Gold from excel #71

Closed
wants to merge 10 commits into from
Closed

DRAFT: Gold from excel #71

wants to merge 10 commits into from

Conversation

i-be-snek
Copy link
Collaborator

@i-be-snek i-be-snek commented Aug 7, 2024

This PR does two things:

  • modifies the parsing function so that it can extract data from our gold datatable in excel and store it as parquet
  • adds the latest gold data (timestamped with 07082024) separated by dev and test

@liniiiiii
Could you please review this branch and see if it works for you as well?
Additionally, what do you think about adding another function to the same python file (gold_from_excel.py) that takes the "Location" column, turns it to Country and Location, and then performs location normalization? If you like that idea, let me know in a comment and I will add it. Then what ends up in the parquet is very close and ready for evaluation ;)


Checklist:

  • Implement location normalization for the three levels of events (CHECK SPECS HERE)
  • Extract three event levels as parquet files
  • Change column names to match the new changes
    • when parsing excel
    • when parsing events
    • when inserting events into the db
    • Add Num_Min/Num_Max columns for subevents
    • Rename "country" to "administrative area"
  • Fixes to the data table:
    • Ni: fix Event_ID 258.00 in the data table
    • Ni: fix/check some location names that cause errors (comment in "File Changes" section)
    • Shorouq: redownload a fresh copy of the gold data and parse again
  • Database infrastructure:
    • finalze sql schema
    • start a new db (impact.v2.db) with the correct column names in the schema, ready for data insertion from the llms
    • populate impact.v2.db with data from the latest experiments on test/dev for gpt4
  • Evaluation:
    • some dev and test set events don't have sub-events filled automatically (not by a human), so we need to ignore their subevents in the evaluation based on whether or not they have only one Event_ID for one URL. Implement this in the excel parsing method.

@i-be-snek i-be-snek requested a review from liniiiiii August 7, 2024 14:18
@i-be-snek i-be-snek self-assigned this Aug 7, 2024
@i-be-snek i-be-snek added bug Something isn't working data table Issues relating to the gold annotations (data table) labels Aug 7, 2024
@i-be-snek i-be-snek linked an issue Aug 7, 2024 that may be closed by this pull request
@liniiiiii
Copy link
Collaborator

I forget to mention, that we need to revise the gold of the sub-event to Num_Min/Max, instead of using the name of the main event level :)

@i-be-snek
Copy link
Collaborator Author

Okay, in this case this will have to be done tomorrow, hopefully by EoD 😊

@liniiiiii
Copy link
Collaborator

Notes:

  1. Eevent_ID 258.00 is marked as test, so there is 159 single events in the test set, which I edit in the sheet, ImpactDB_manual_copy_07082024, please download it again and test
  2. for the dev and test set, some event where sub-events are not filled by human, so in the evaluation, need to ignore them for the sub events, the identification is that there is only one event-id for one url such as examples below in the excel sheet ImpactDB_manual_copy_07082024
6b2YWky | 141.00 | test | Tropical Storm/Cyclone | Wind\|Flood | https://en.wikipedia.org/wiki/Cyclone_Ockhi
-- | -- | -- | -- | -- | --
Q7yoZXl | 142.00 | test | Tropical Storm/Cyclone | Wind\|Flood | https://en.wikipedia.org/wiki/Hurricane_Bob_(1979)
ikokrVG | 143.00 | test | Tropical Storm/Cyclone | Wind\|Flood | https://en.wikipedia.org/wiki/Cyclone_Idai
  1. in the excel sheet ImpactDB_manual_copy_07082024, there are 78 (70 single, 8 multi) events in dev set, 160 (159 single, 1 multi) events in test set, 12 new events ( mostly like multi events, or associated events within one article)

@liniiiiii
Copy link
Collaborator

More TODOs:

  • Implement location normalization for the three levels of events
  • Change column names to match the new changes
  • Add Num_Min/Num_Max columns for subevents

yes, it's changed in the original Excel, pls copy the column names from the sheet ImpactDB_final_2_wiki_cp_v0806

@i-be-snek i-be-snek changed the title Gold from excel DRAFT: Gold from excel Aug 7, 2024
@i-be-snek
Copy link
Collaborator Author

yes, it's changed in the original Excel, pls copy the column names from the sheet ImpactDB_final_2_wiki_cp_v0806

The column names have "Total" in them. Do we want that prefix for main events? 🤔

@liniiiiii
Copy link
Collaborator

yes, it's changed in the original Excel, pls copy the column names from the sheet ImpactDB_final_2_wiki_cp_v0806

The column names have "Total" in them. Do we want that prefix for main events? 🤔

Yes, I think it would be better to keep it, in the Level 1 (main event) information

@i-be-snek
Copy link
Collaborator Author

@liniiiiii great! that means we need to make less changes.
I have added one more thing to the checklist: changing "country" to "administrative areas".

@liniiiiii
Copy link
Collaborator

@liniiiiii great! that means we need to make less changes. I have added one more thing to the checklist: changing "country" to "administrative areas".

is it in the level 2 information? where contains only the country-level

@i-be-snek
Copy link
Collaborator Author

@liniiiiii great! that means we need to make less changes. I have added one more thing to the checklist: changing "country" to "administrative areas".

is it in the level 2 information? where contains only the country-level

I changed it everywhere, I think. I have not even separated level 2 from 3 yet, but I am guessing we want to change them there too?

@liniiiiii
Copy link
Collaborator

administrative areas

Yeah, I think we can change it everywhere for 3 levels of information, idealy I think the gold "Country_Norm" also includes some regions right, like Taiwan etc, the motivation of this is not to consider the political conflicts of these specific locations. If I understand it wrongly pls help to correct me

@i-be-snek
Copy link
Collaborator Author

administrative areas

Yeah, I think we can change it everywhere for 3 levels of information, idealy I think the gold "Country_Norm" also includes some regions right, like Taiwan etc, the motivation of this is not to consider the political conflicts of these specific locations. If I understand it wrongly pls help to correct me

I believe we are on the same page :D So yes we will change it everywhere.

My main motivation is that this is an "Administrative area" object on OpenStreetMap, GADM, and UNSD's region database so we should just use the correct technical term and explain in the documentation that this is often a country-level entity. For countries, we will just go by the UNSD since it contains the biggest number of country and region names and it's published by the UN. Whatever any country's political views are, they are probably part of the UN anyway so it's an acceptable official document.

I will add these things to the list of things we need to complete.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @liniiiiii

Could you please check this schema? Can you confirm that all of these fields and their names and types (very important!) are correct?

Let me know if you have any SQL related questions! If a field is marked as ARRAY in the COMMENT section, it means that the field can only be a list (a list of lists is allowed but no deeper than that).

@i-be-snek
Copy link
Collaborator Author

@liniiiiii

While attempting to parse location data from the gold files from ImpactDB_manual_copy_07082024 in this branch, I found that the location normalization has issues with the following area names. They may be misspellings that we need to fix in the gold datafile here: https://onedrive.live.com/edit?id=78D0E12AB2E8CE00!120&resid=78D0E12AB2E8CE00!120&cid=78d0e12ab2e8ce00&ithint=file%2Cxlsx&redeem=aHR0cHM6Ly8xZHJ2Lm1zL3gvYy83OGQwZTEyYWIyZThjZTAwL0VRRE82TElxNGRBZ2dIaDRBQUFBQUFBQmRtVjBhWnk5bTV5bGo3MmVraDg4QlE_ZT14SHhFRjE&migratedtospo=true&wdo=2.

# dev
normalize_locations: 2024-08-08 21:35:10 ERROR    Could not find location lake hong; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'keys'.
normalize_locations: 2024-08-08 21:36:30 ERROR    Could not find location płosk; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'raw'.
normalize_locations: 2024-08-08 21:38:52 ERROR    Could not find location arumeru district; is_country: False; in_country: None. Error message: 'name'.
normalize_locations: 2024-08-08 21:39:54 ERROR    Could not find location biljeljina; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'raw'.
normalize_locations: 2024-08-08 21:40:40 ERROR    Could not find location illoilo; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'raw'.

# test
normalize_locations: 2024-08-08 20:53:17 ERROR    Could not find location zheijang; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'raw'.

…y (event levels 1-3) from gold

Includes generated gold files
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liniiiiii

While attempting to parse location data from the gold files from ImpactDB_manual_copy_07082024 in this branch, I found that the location normalization has issues with the following area names. They may be misspellings that we need to fix in the gold datafile here: https://onedrive.live.com/edit?id=78D0E12AB2E8CE00!120&resid=78D0E12AB2E8CE00!120&cid=78d0e12ab2e8ce00&ithint=file%2Cxlsx&redeem=aHR0cHM6Ly8xZHJ2Lm1zL3gvYy83OGQwZTEyYWIyZThjZTAwL0VRRE82TElxNGRBZ2dIaDRBQUFBQUFBQmRtVjBhWnk5bTV5bGo3MmVraDg4QlE_ZT14SHhFRjE&migratedtospo=true&wdo=2.

# dev
normalize_locations: 2024-08-08 21:35:10 ERROR    Could not find location lake hong; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'keys'.
normalize_locations: 2024-08-08 21:36:30 ERROR    Could not find location płosk; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'raw'.
normalize_locations: 2024-08-08 21:38:52 ERROR    Could not find location arumeru district; is_country: False; in_country: None. Error message: 'name'.
normalize_locations: 2024-08-08 21:39:54 ERROR    Could not find location biljeljina; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'raw'.
normalize_locations: 2024-08-08 21:40:40 ERROR    Could not find location illoilo; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'raw'.

# test
normalize_locations: 2024-08-08 20:53:17 ERROR    Could not find location zheijang; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'raw'.

If you make any updates do this, could you please do it to the datatable online on excel and let me know? They would need to be fixed in mutliple sheets because ImpactDB_manual_copy_07082024 is a snapshot copy taken yesterday.

@i-be-snek
Copy link
Collaborator Author

Notes:

1. **Eevent_ID 258.00** is marked as test, so there is 159 single events in the test set, which I edit in the sheet, **ImpactDB_manual_copy_07082024**, please download it again and test

2. for the dev and test set, some event where sub-events are not filled by human, so in the evaluation, need to ignore them for the sub events, the identification is that there is only one event-id for one url such as examples below in the excel sheet **ImpactDB_manual_copy_07082024**
6b2YWky | 141.00 | test | Tropical Storm/Cyclone | Wind\|Flood | https://en.wikipedia.org/wiki/Cyclone_Ockhi
-- | -- | -- | -- | -- | --
Q7yoZXl | 142.00 | test | Tropical Storm/Cyclone | Wind\|Flood | https://en.wikipedia.org/wiki/Hurricane_Bob_(1979)
ikokrVG | 143.00 | test | Tropical Storm/Cyclone | Wind\|Flood | https://en.wikipedia.org/wiki/Cyclone_Idai
3. in the excel sheet **ImpactDB_manual_copy_07082024**, there are 78 (70 single, 8 multi) events in dev set, 160 (159 single, 1 multi) events in test set, 12 new events ( mostly like multi events, or associated events within one article)

For 1. I can do that when we fix some of the location names in the list I tagged you so we only have to do it once

I will add number 2 to the todolist.

@i-be-snek
Copy link
Collaborator Author

2. for the dev and test set, some event where sub-events are not filled by human, so in the evaluation, need to ignore them for the sub events, the identification is that there is only one event-id for one url such as examples below in the excel sheet **ImpactDB_manual_copy_07082024**

@liniiiiii this part is a little confusing for me. Do you mean they are events with only one URL in the Source column and only one "Event_ID" (decimal event_id?) I checked the snapshot we took (ImpactDB_manual_copy_07082024) and there is only one URL in all of them 🤔

3. in the excel sheet **ImpactDB_manual_copy_07082024**, there are 78 (70 single, 8 multi) events in dev set, 160 (159 single, 1 multi) events in test set, 12 new events ( mostly like multi events, or associated events within one article)

@liniiiiii do we need to know anything about this or is this just a general note?

@liniiiiii
Copy link
Collaborator

2. for the dev and test set, some event where sub-events are not filled by human, so in the evaluation, need to ignore them for the sub events, the identification is that there is only one event-id for one url such as examples below in the excel sheet **ImpactDB_manual_copy_07082024**

@liniiiiii this part is a little confusing for me. Do you mean they are events with only one URL in the Source column and only one "Event_ID" (decimal event_id?) I checked the snapshot we took (ImpactDB_manual_copy_07082024) and there is only one URL in all of them 🤔

3. in the excel sheet **ImpactDB_manual_copy_07082024**, there are 78 (70 single, 8 multi) events in dev set, 160 (159 single, 1 multi) events in test set, 12 new events ( mostly like multi events, or associated events within one article)

@liniiiiii do we need to know anything about this or is this just a general note?

@i-be-snek for this, is that you can see some events only have id with .00 without any sub-events, which are not filled with sub-events by human annotators, as far as I know not all the events in the gold have annotated with sub-events due to the workload, so I think we can flag them out in the sub-event evaluation, otherwise we will get a lot of penalty scores.

@i-be-snek
Copy link
Collaborator Author

I am closing this PR to start a new one with a fresh TODO list to match our frozen guidelines/instructions.

@i-be-snek i-be-snek closed this Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working data table Issues relating to the gold annotations (data table)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Issue of implement the gold_from_excel.py
2 participants