-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DRAFT: Gold from excel #71
Conversation
I forget to mention, that we need to revise the gold of the sub-event to Num_Min/Max, instead of using the name of the main event level :) |
Okay, in this case this will have to be done tomorrow, hopefully by EoD 😊 |
Notes:
6b2YWky | 141.00 | test | Tropical Storm/Cyclone | Wind\|Flood | https://en.wikipedia.org/wiki/Cyclone_Ockhi
-- | -- | -- | -- | -- | --
Q7yoZXl | 142.00 | test | Tropical Storm/Cyclone | Wind\|Flood | https://en.wikipedia.org/wiki/Hurricane_Bob_(1979)
ikokrVG | 143.00 | test | Tropical Storm/Cyclone | Wind\|Flood | https://en.wikipedia.org/wiki/Cyclone_Idai
|
yes, it's changed in the original Excel, pls copy the column names from the sheet ImpactDB_final_2_wiki_cp_v0806 |
The column names have "Total" in them. Do we want that prefix for main events? 🤔 |
Yes, I think it would be better to keep it, in the Level 1 (main event) information |
@liniiiiii great! that means we need to make less changes. |
is it in the level 2 information? where contains only the country-level |
…untry -> administrative area)
I changed it everywhere, I think. I have not even separated level 2 from 3 yet, but I am guessing we want to change them there too? |
Yeah, I think we can change it everywhere for 3 levels of information, idealy I think the gold "Country_Norm" also includes some regions right, like Taiwan etc, the motivation of this is not to consider the political conflicts of these specific locations. If I understand it wrongly pls help to correct me |
I believe we are on the same page :D So yes we will change it everywhere. My main motivation is that this is an "Administrative area" object on OpenStreetMap, GADM, and UNSD's region database so we should just use the correct technical term and explain in the documentation that this is often a country-level entity. For countries, we will just go by the UNSD since it contains the biggest number of country and region names and it's published by the UN. Whatever any country's political views are, they are probably part of the UN anyway so it's an acceptable official document. I will add these things to the list of things we need to complete. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @liniiiiii
Could you please check this schema? Can you confirm that all of these fields and their names and types (very important!) are correct?
Let me know if you have any SQL related questions! If a field is marked as ARRAY
in the COMMENT
section, it means that the field can only be a list (a list of lists is allowed but no deeper than that).
While attempting to parse location data from the gold files from ImpactDB_manual_copy_07082024 in this branch, I found that the location normalization has issues with the following area names. They may be misspellings that we need to fix in the gold datafile here: https://onedrive.live.com/edit?id=78D0E12AB2E8CE00!120&resid=78D0E12AB2E8CE00!120&cid=78d0e12ab2e8ce00&ithint=file%2Cxlsx&redeem=aHR0cHM6Ly8xZHJ2Lm1zL3gvYy83OGQwZTEyYWIyZThjZTAwL0VRRE82TElxNGRBZ2dIaDRBQUFBQUFBQmRtVjBhWnk5bTV5bGo3MmVraDg4QlE_ZT14SHhFRjE&migratedtospo=true&wdo=2.
|
…y (event levels 1-3) from gold Includes generated gold files
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While attempting to parse location data from the gold files from ImpactDB_manual_copy_07082024
in this branch, I found that the location normalization has issues with the following area names. They may be misspellings that we need to fix in the gold datafile here: https://onedrive.live.com/edit?id=78D0E12AB2E8CE00!120&resid=78D0E12AB2E8CE00!120&cid=78d0e12ab2e8ce00&ithint=file%2Cxlsx&redeem=aHR0cHM6Ly8xZHJ2Lm1zL3gvYy83OGQwZTEyYWIyZThjZTAwL0VRRE82TElxNGRBZ2dIaDRBQUFBQUFBQmRtVjBhWnk5bTV5bGo3MmVraDg4QlE_ZT14SHhFRjE&migratedtospo=true&wdo=2.
# dev
normalize_locations: 2024-08-08 21:35:10 ERROR Could not find location lake hong; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'keys'.
normalize_locations: 2024-08-08 21:36:30 ERROR Could not find location płosk; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'raw'.
normalize_locations: 2024-08-08 21:38:52 ERROR Could not find location arumeru district; is_country: False; in_country: None. Error message: 'name'.
normalize_locations: 2024-08-08 21:39:54 ERROR Could not find location biljeljina; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'raw'.
normalize_locations: 2024-08-08 21:40:40 ERROR Could not find location illoilo; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'raw'.
# test
normalize_locations: 2024-08-08 20:53:17 ERROR Could not find location zheijang; is_country: False; in_country: None. Error message: 'NoneType' object has no attribute 'raw'.
If you make any updates do this, could you please do it to the datatable online on excel and let me know? They would need to be fixed in mutliple sheets because ImpactDB_manual_copy_07082024
is a snapshot copy taken yesterday.
For 1. I can do that when we fix some of the location names in the list I tagged you so we only have to do it once I will add number 2 to the todolist. |
@liniiiiii this part is a little confusing for me. Do you mean they are events with only one URL in the Source column and only one "Event_ID" (decimal event_id?) I checked the snapshot we took (ImpactDB_manual_copy_07082024) and there is only one URL in all of them 🤔
@liniiiiii do we need to know anything about this or is this just a general note? |
@i-be-snek for this, is that you can see some events only have id with .00 without any sub-events, which are not filled with sub-events by human annotators, as far as I know not all the events in the gold have annotated with sub-events due to the workload, so I think we can flag them out in the sub-event evaluation, otherwise we will get a lot of penalty scores. |
I am closing this PR to start a new one with a fresh TODO list to match our frozen guidelines/instructions. |
This PR does two things:
@liniiiiii
Could you please review this branch and see if it works for you as well?
Additionally, what do you think about adding another function to the same python file (
gold_from_excel.py
) that takes the "Location" column, turns it to Country and Location, and then performs location normalization? If you like that idea, let me know in a comment and I will add it. Then what ends up in the parquet is very close and ready for evaluation ;)Checklist:
impact.v2.db
) with the correct column names in the schema, ready for data insertion from the llms