Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

81 level of information database #85

Merged

Conversation

i-be-snek
Copy link
Collaborator

@i-be-snek i-be-snek commented Aug 21, 2024

This PR is one part of Issue #81.

The idea is to try to get everything from #81 to work in separate branches:

Each branch represents a checkpoint to complete Issue #81, so its progress will be merged gradually into 81-level-of-information-dev. Once the changes from all three sub-branches are in 81-level-of-information-dev and we know they work without problems, 81-level-of-information-dev can be merged with main.

Overview of the changes:

  • The schema has been updated and a fresh database has been spun up
  • We can now successfully extract L1-3 from the excel datatable as long as it follows the specified template in the excel sheet -- there is more info on that in the README.

Required to merge with 81-level-of-information-dev:

  • The new database schema corresponds with the drawio png image
  • There is a new empty database
  • We can extract L1-3 from the flat excel. Note: we don't have the final data for that, only a working copy since more edits are required in the flat data table, but as long as we have a file that exactly matches the template in the excel sheet, we should have no problems

@liniiiiii could you please check these three things specified in the list above? If you think they make sense, please tick the box. If you find anything funny, let me know.

@i-be-snek i-be-snek linked an issue Aug 21, 2024 that may be closed by this pull request
13 tasks
@i-be-snek

This comment was marked as resolved.

@liniiiiii

This comment was marked as resolved.

@i-be-snek

This comment was marked as resolved.

@liniiiiii
Copy link
Collaborator

@liniiiiii American states are stored as "Oregon, United States". These are not rendered so nicely when viewing parquet files in VSCode. Maybe this screenshot can make it more clear:

image

For example, the one you have with Oregon is actually like this:

# how it may look like
["Multnomah County", "Oregon", "United States", "Oregon", "United States"]

# how it actually is
["Multnomah County, Oregon, United States", "Oregon, United States"] 

We can de-attach the word "United States" from these if you like if it will make things clearer, that's not an issue.

@i-be-snek , I see, and I have a question about the evaluation, so when we detect the location to map the L3 event,

Location "Oregon" will be looked as the same with the location "Oregon, United States" or not?

If not, then the model will get a penalty score?

@i-be-snek
Copy link
Collaborator Author

i-be-snek commented Aug 28, 2024

@i-be-snek , I see, and I have a question about the evaluation, so when we detect the location to map the L3 event,

Location "Oregon" will be looked as the same with the location "Oregon, United States" or not?

If not, then the model will get a penalty score?

@liniiiiii
Since we use the same normalization script, any mention of Oregon (even as "OR, US", which is the state code of Oregon) will be normalized to the string "Oregon, United States". So when we evaluate, the values from the gold (which uses the same script) will match that of the LLM.

If you think this is confusing or inconsistent, I can strip ",United States" from the output whenever we get that result back from OSM. It's not a big problem.

@liniiiiii
Copy link
Collaborator

@i-be-snek , I see, and I have a question about the evaluation, so when we detect the location to map the L3 event,

Location "Oregon" will be looked as the same with the location "Oregon, United States" or not?

If not, then the model will get a penalty score?

@liniiiiii Since we use the same normalization script, any mention of Oregon (even as "OR, US", which is the state code of Oregon) will be normalized to the string "Oregon, United States". So when we evaluate, the values from the gold (which uses the same script) will match that of the LLM.

If you think this is confusing or inconsistent, I can strip ",United States" from the output whenever we get that result back from OSM. It's not a big problem.

@i-be-snek , I see, so we will normalize to the same name and GID, and match them, then there is no need to spend time to strip the country, and in the final database, it's also fine to keep it!

@liniiiiii
Copy link
Collaborator

This PR is one part of Issue #81.

The idea is to try to get everything from #81 to work in separate branches:

Each branch represents a checkpoint to complete Issue #81, so its progress will be merged gradually into 81-level-of-information-dev. Once the changes from all three sub-branches are in 81-level-of-information-dev and we know they work without problems, 81-level-of-information-dev can be merged with main.

Overview of the changes:

  • The schema has been updated and a fresh database has been spun up
  • We can now successfully extract L1-3 from the excel datatable as long as it follows the specified template in the excel sheet -- there is more info on that in the README.

Required to merge with 81-level-of-information-dev:

  • The new database schema corresponds with the drawio png image
  • There is a new empty database
  • We can extract L1-3 from the flat excel. Note: we don't have the final data for that, only a working copy since more edits are required in the flat data table, but as long as we have a file that exactly matches the template in the excel sheet, we should have no problems

@liniiiiii could you please check these three things specified in the list above? If you think they make sense, please tick the box. If you find anything funny, let me know.

@i-be-snek , for the 1st and 3rd tasks, I think the script is ready to proceed, and I wonder what is the 2nd task for a new empty database?

@i-be-snek
Copy link
Collaborator Author

@liniiiiii For the 2nd task, it would be great if you could take a peak inside impact.v1.db and make sure the table and column names all look good and that all categories are in it.

I use this tool in VSCode to view it more easily
image

@i-be-snek i-be-snek force-pushed the 81-level-of-information-database branch from a42ceef to 1f93227 Compare August 30, 2024 13:16
@i-be-snek
Copy link
Collaborator Author

@liniiiiii

I've now pushed some changes so that all table columns, whether in the gold or the database, comply with our new selected names.

The changes have not been made yet for the parsing, but will be soon in #88! Right now the changes are only reflected in the gold data and the database schema

@liniiiiii
Copy link
Collaborator

@liniiiiii For the 2nd task, it would be great if you could take a peak inside impact.v1.db and make sure the table and column names all look good and that all categories are in it.

I use this tool in VSCode to view it more easily image

@i-be-snek , hi, thanks, I checked the impactdb.v1.db, it's matched with the variable names as we discussed!

@i-be-snek
Copy link
Collaborator Author

@liniiiiii That's great, thanks!

I'm going to merge this with 81-level-of-information-dev. I will merge the other branches for Issue #81 there too until it all looks good.

@i-be-snek i-be-snek merged commit b450e6b into 81-level-of-information-dev Aug 30, 2024
i-be-snek added a commit that referenced this pull request Sep 8, 2024
- The new database schema corresponds with the drawio png image

- There is a new empty database

- We can extract L1-3 from the flat excel. Note: we don't have the final data for that, only a working copy since more edits are required in the flat data table, but as long as we have a file that exactly matches the template in the excel sheet, we should have no problems
i-be-snek added a commit that referenced this pull request Sep 8, 2024
- The new database schema corresponds with the drawio png image

- There is a new empty database

- We can extract L1-3 from the flat excel. Note: we don't have the final data for that, only a working copy since more edits are required in the flat data table, but as long as we have a file that exactly matches the template in the excel sheet, we should have no problems
i-be-snek added a commit that referenced this pull request Sep 16, 2024
- The new database schema corresponds with the drawio png image

- There is a new empty database

- We can extract L1-3 from the flat excel. Note: we don't have the final data for that, only a working copy since more edits are required in the flat data table, but as long as we have a file that exactly matches the template in the excel sheet, we should have no problems
i-be-snek added a commit that referenced this pull request Sep 16, 2024
- The new database schema corresponds with the drawio png image

- There is a new empty database

- We can extract L1-3 from the flat excel. Note: we don't have the final data for that, only a working copy since more edits are required in the flat data table, but as long as we have a file that exactly matches the template in the excel sheet, we should have no problems
i-be-snek added a commit that referenced this pull request Sep 16, 2024
- The new database schema corresponds with the drawio png image

- There is a new empty database

- We can extract L1-3 from the flat excel. Note: we don't have the final data for that, only a working copy since more edits are required in the flat data table, but as long as we have a file that exactly matches the template in the excel sheet, we should have no problems
i-be-snek added a commit that referenced this pull request Sep 16, 2024
- The new database schema corresponds with the drawio png image

- There is a new empty database

- We can extract L1-3 from the flat excel. Note: we don't have the final data for that, only a working copy since more edits are required in the flat data table, but as long as we have a file that exactly matches the template in the excel sheet, we should have no problems
i-be-snek added a commit that referenced this pull request Sep 16, 2024
- The new database schema corresponds with the drawio png image

- There is a new empty database

- We can extract L1-3 from the flat excel. Note: we don't have the final data for that, only a working copy since more edits are required in the flat data table, but as long as we have a file that exactly matches the template in the excel sheet, we should have no problems
i-be-snek added a commit that referenced this pull request Sep 16, 2024
- The new database schema corresponds with the drawio png image

- There is a new empty database

- We can extract L1-3 from the flat excel. Note: we don't have the final data for that, only a working copy since more edits are required in the flat data table, but as long as we have a file that exactly matches the template in the excel sheet, we should have no problems
i-be-snek added a commit that referenced this pull request Sep 16, 2024
- The new database schema corresponds with the drawio png image

- There is a new empty database

- We can extract L1-3 from the flat excel. Note: we don't have the final data for that, only a working copy since more edits are required in the flat data table, but as long as we have a file that exactly matches the template in the excel sheet, we should have no problems
i-be-snek added a commit that referenced this pull request Sep 19, 2024
@i-be-snek i-be-snek deleted the 81-level-of-information-database branch October 28, 2024 15:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Level of information: introducing L1-L3 (3 PRs)
3 participants