Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Normalization issue in the raw output #26

Closed
liniiiiii opened this issue Jul 2, 2024 · 12 comments
Closed

Normalization issue in the raw output #26

liniiiiii opened this issue Jul 2, 2024 · 12 comments
Assignees
Labels
backlog Things to do later enhancement New feature or request

Comments

@liniiiiii
Copy link
Collaborator

from the GPT4o experiment, in the event ID 60.0, the total_damage is not well normalized,
the raw Rs14,000 crore (US$2.18 billion), and the normalized output is NULL

@liniiiiii
Copy link
Collaborator Author

liniiiiii commented Jul 2, 2024

in event ID 87, the article says the damage is Moderate,
in event ID 96/116/99, the article says the damage is minimal,
in event ID 106, the article says Little damage,
how do we normalize it?

@liniiiiii
Copy link
Collaborator Author

liniiiiii commented Jul 2, 2024

do we normalize "several communities", "three towns"
EventID 1
'Czech Republic[edit]': 'Heavy rainfall led to flooding especially in North Bohemia, and several communities near Česká Lípa had to be evacuated

EventID 82
In neighboring Belize, officials began evacuating three towns near the border with Guatemala after flood waters rose

@i-be-snek i-be-snek self-assigned this Jul 8, 2024
@i-be-snek
Copy link
Collaborator

from the GPT4o experiment, in the event ID 60.0, the total_damage is not well normalized, the raw Rs14,000 crore (US$2.18 billion), and the normalized output is NULL

We can handle this in #19

@i-be-snek
Copy link
Collaborator

in event ID 87, the article says the damage is Moderate, in event ID 96/116/99, the article says the damage is minimal, in event ID 106, the article says Little damage, how do we normalize it?

"Moderate" is a bit tricky for me... but I think "minimal" and "little" could be interpreted as zero. We can also handle these fixes in #19

@i-be-snek
Copy link
Collaborator

i-be-snek commented Jul 11, 2024

More cases:

  • 207 million córdoba -> None
  • 2 billion MXN -> None
  • Not specified -> None
  • Minor [insurance] losses -> None
  • Unknown specific amount -> None
  • Not quantified -> None

@liniiiiii
Copy link
Collaborator Author

in event ID 87, the article says the damage is Moderate, in event ID 96/116/99, the article says the damage is minimal, in event ID 106, the article says Little damage, how do we normalize it?

"Moderate" is a bit tricky for me... but I think "minimal" and "little" could be interpreted as zero. We can also handle these fixes in #19

For this, in the gold data, I revised all of them to NULL, because we don't assume a number there, is it workable?

@i-be-snek
Copy link
Collaborator

in event ID 87, the article says the damage is Moderate, in event ID 96/116/99, the article says the damage is minimal, in event ID 106, the article says Little damage, how do we normalize it?

"Moderate" is a bit tricky for me... but I think "minimal" and "little" could be interpreted as zero. We can also handle these fixes in #19

For this, in the gold data, I revised all of them to NULL, because we don't assume a number there, is it workable?

I think that's okay from my end :) we have a big list of expressions that evaluate to None, I can add these all in #19 later on

@i-be-snek i-be-snek added the enhancement New feature or request label Jul 12, 2024
@liniiiiii
Copy link
Collaborator Author

liniiiiii commented Jul 19, 2024

Hi @i-be-snek ,Once the raw for this column is No, need to convert to No instead of 0, could you check?

Event_ID 1.0
raw:  "Total_Economic_Damage_Inflation_Adjusted": "No",
output: Total_Damage_Inflation_Adjusted":0

same for this column:

raw:"Total_Insured_Damage_Inflation_Adjusted": "No",
output:"Total_Insured_Damage_Inflation_Adjusted":0,

find 53 events with this issue for Total_Economic_Damage_Inflation_Adjusted, and 6 for Total_Insured_Damage_Inflation_Adjusted
image
image

@i-be-snek
Copy link
Collaborator

During the evaluation, No, no, NO, n, FALSE, False and even 0 (plus other variations) are converted to False

Same thing for Yes, etc
So even if the gold says "YES", or "No", it's okay.

Although I agree, that should probably be normalized properly in the gold data to "True" and "False"

The parse_events.py script should convert "No" to False, but I think what happens is that "False" is evaluated as 0 when it is turned into .parquet.

The result below:

raw:"Total_Insured_Damage_Inflation_Adjusted": "No",
output:"Total_Insured_Damage_Inflation_Adjusted":0,

Is this what you get after running parse_events.py on that raw file?

@liniiiiii
Copy link
Collaborator Author

liniiiiii commented Jul 23, 2024

During the evaluation, No, no, NO, n, FALSE, False and even 0 (plus other variations) are converted to False

Same thing for Yes, etc So even if the gold says "YES", or "No", it's okay.

Although I agree, that should probably be normalized properly in the gold data to "True" and "False"

The parse_events.py script should convert "No" to False, but I think what happens is that "False" is evaluated as 0 when it is turned into .parquet.

The result below:

raw:"Total_Insured_Damage_Inflation_Adjusted": "No",
output:"Total_Insured_Damage_Inflation_Adjusted":0,

Is this what you get after running parse_events.py on that raw file?

@i-be-snek Yes, I run the parse_events.py on the file below
the raw is

https://github.com/VUB-HYDR/Wikimpacts/blob/prompt_GPT4o/Database/raw/ESSD_2024/Wiki_dev_set_GPT4o_20240715_full_text_feeding_74single_events_main_col_name_updated_ni_0719.json

the output file is

https://github.com/VUB-HYDR/Wikimpacts/blob/prompt_GPT4o/Database/output/ESSD_2024/test_dev/Wiki_dev_set_GPT4o_20240718_full_text_feeding_74single_events_main_col_name_matched_fixed.parquet

and I search Total_Damage_Inflation_Adjusted in the output and I can find this:

"Total_Damage_Inflation_Adjusted":0,

@i-be-snek
Copy link
Collaborator

This will have to have a low priority now since it won't affect the flow. We can add it to your list of "improvements" when we have the core features out.

@i-be-snek i-be-snek pinned this issue Jul 28, 2024
@i-be-snek i-be-snek unpinned this issue Jul 28, 2024
@i-be-snek i-be-snek added the backlog Things to do later label Jul 28, 2024
@i-be-snek
Copy link
Collaborator

i-be-snek commented Sep 24, 2024

I think at this point we can agree that 0 is evaluated as False in python (and vice versa) and that this has not been causing issues. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog Things to do later enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants