Normalization issue in the raw output #26

liniiiiii · 2024-07-02T09:27:05Z

from the GPT4o experiment, in the event ID 60.0, the total_damage is not well normalized,
the raw Rs14,000 crore (US$2.18 billion), and the normalized output is NULL

liniiiiii · 2024-07-02T12:56:00Z

in event ID 87, the article says the damage is Moderate,
in event ID 96/116/99, the article says the damage is minimal,
in event ID 106, the article says Little damage,
how do we normalize it?

liniiiiii · 2024-07-02T14:04:49Z

do we normalize "several communities", "three towns"
EventID 1
'Czech Republic[edit]': 'Heavy rainfall led to flooding especially in North Bohemia, and several communities near Česká Lípa had to be evacuated

EventID 82
In neighboring Belize, officials began evacuating three towns near the border with Guatemala after flood waters rose

i-be-snek · 2024-07-09T15:01:34Z

from the GPT4o experiment, in the event ID 60.0, the total_damage is not well normalized, the raw Rs14,000 crore (US$2.18 billion), and the normalized output is NULL

We can handle this in #19

i-be-snek · 2024-07-09T15:02:33Z

in event ID 87, the article says the damage is Moderate, in event ID 96/116/99, the article says the damage is minimal, in event ID 106, the article says Little damage, how do we normalize it?

"Moderate" is a bit tricky for me... but I think "minimal" and "little" could be interpreted as zero. We can also handle these fixes in #19

i-be-snek · 2024-07-11T09:09:55Z

More cases:

207 million córdoba -> None
2 billion MXN -> None
Not specified -> None
Minor [insurance] losses -> None
Unknown specific amount -> None
Not quantified -> None

liniiiiii · 2024-07-12T11:58:27Z

in event ID 87, the article says the damage is Moderate, in event ID 96/116/99, the article says the damage is minimal, in event ID 106, the article says Little damage, how do we normalize it?

"Moderate" is a bit tricky for me... but I think "minimal" and "little" could be interpreted as zero. We can also handle these fixes in #19

For this, in the gold data, I revised all of them to NULL, because we don't assume a number there, is it workable?

i-be-snek · 2024-07-12T13:41:08Z

in event ID 87, the article says the damage is Moderate, in event ID 96/116/99, the article says the damage is minimal, in event ID 106, the article says Little damage, how do we normalize it?

"Moderate" is a bit tricky for me... but I think "minimal" and "little" could be interpreted as zero. We can also handle these fixes in #19

For this, in the gold data, I revised all of them to NULL, because we don't assume a number there, is it workable?

I think that's okay from my end :) we have a big list of expressions that evaluate to None, I can add these all in #19 later on

liniiiiii · 2024-07-19T11:56:05Z

Hi @i-be-snek ,Once the raw for this column is No, need to convert to No instead of 0, could you check?

Event_ID 1.0
raw:  "Total_Economic_Damage_Inflation_Adjusted": "No",
output: Total_Damage_Inflation_Adjusted":0

same for this column:

raw:"Total_Insured_Damage_Inflation_Adjusted": "No",
output:"Total_Insured_Damage_Inflation_Adjusted":0,

find 53 events with this issue for Total_Economic_Damage_Inflation_Adjusted, and 6 for Total_Insured_Damage_Inflation_Adjusted

i-be-snek · 2024-07-19T14:11:17Z

During the evaluation, No, no, NO, n, FALSE, False and even 0 (plus other variations) are converted to False

Same thing for Yes, etc
So even if the gold says "YES", or "No", it's okay.

Although I agree, that should probably be normalized properly in the gold data to "True" and "False"

The parse_events.py script should convert "No" to False, but I think what happens is that "False" is evaluated as 0 when it is turned into .parquet.

The result below:

raw:"Total_Insured_Damage_Inflation_Adjusted": "No",
output:"Total_Insured_Damage_Inflation_Adjusted":0,

Is this what you get after running parse_events.py on that raw file?

liniiiiii · 2024-07-23T12:46:41Z

During the evaluation, No, no, NO, n, FALSE, False and even 0 (plus other variations) are converted to False

Same thing for Yes, etc So even if the gold says "YES", or "No", it's okay.

Although I agree, that should probably be normalized properly in the gold data to "True" and "False"

The parse_events.py script should convert "No" to False, but I think what happens is that "False" is evaluated as 0 when it is turned into .parquet.

The result below:
raw:"Total_Insured_Damage_Inflation_Adjusted": "No",
output:"Total_Insured_Damage_Inflation_Adjusted":0,
Is this what you get after running parse_events.py on that raw file?

@i-be-snek Yes, I run the parse_events.py on the file below
the raw is

https://github.com/VUB-HYDR/Wikimpacts/blob/prompt_GPT4o/Database/raw/ESSD_2024/Wiki_dev_set_GPT4o_20240715_full_text_feeding_74single_events_main_col_name_updated_ni_0719.json

the output file is

https://github.com/VUB-HYDR/Wikimpacts/blob/prompt_GPT4o/Database/output/ESSD_2024/test_dev/Wiki_dev_set_GPT4o_20240718_full_text_feeding_74single_events_main_col_name_matched_fixed.parquet

and I search Total_Damage_Inflation_Adjusted in the output and I can find this:

"Total_Damage_Inflation_Adjusted":0,

i-be-snek · 2024-07-28T21:12:00Z

This will have to have a low priority now since it won't affect the flow. We can add it to your list of "improvements" when we have the core features out.

i-be-snek · 2024-09-24T12:11:18Z

I think at this point we can agree that 0 is evaluated as False in python (and vice versa) and that this has not been causing issues. Closing.

i-be-snek self-assigned this Jul 8, 2024

i-be-snek added the enhancement New feature or request label Jul 12, 2024

i-be-snek pinned this issue Jul 28, 2024

i-be-snek unpinned this issue Jul 28, 2024

i-be-snek added the backlog Things to do later label Jul 28, 2024

i-be-snek closed this as completed Sep 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Normalization issue in the raw output #26

Normalization issue in the raw output #26

liniiiiii commented Jul 2, 2024

liniiiiii commented Jul 2, 2024 •

edited

Loading

liniiiiii commented Jul 2, 2024 •

edited

Loading

i-be-snek commented Jul 9, 2024

i-be-snek commented Jul 9, 2024

i-be-snek commented Jul 11, 2024 •

edited

Loading

liniiiiii commented Jul 12, 2024

i-be-snek commented Jul 12, 2024

liniiiiii commented Jul 19, 2024 •

edited

Loading

i-be-snek commented Jul 19, 2024

liniiiiii commented Jul 23, 2024 •

edited

Loading

i-be-snek commented Jul 28, 2024

i-be-snek commented Sep 24, 2024 •

edited

Loading

Normalization issue in the raw output #26

Normalization issue in the raw output #26

Comments

liniiiiii commented Jul 2, 2024

liniiiiii commented Jul 2, 2024 • edited Loading

liniiiiii commented Jul 2, 2024 • edited Loading

i-be-snek commented Jul 9, 2024

i-be-snek commented Jul 9, 2024

i-be-snek commented Jul 11, 2024 • edited Loading

liniiiiii commented Jul 12, 2024

i-be-snek commented Jul 12, 2024

liniiiiii commented Jul 19, 2024 • edited Loading

i-be-snek commented Jul 19, 2024

liniiiiii commented Jul 23, 2024 • edited Loading

i-be-snek commented Jul 28, 2024

i-be-snek commented Sep 24, 2024 • edited Loading

liniiiiii commented Jul 2, 2024 •

edited

Loading

liniiiiii commented Jul 2, 2024 •

edited

Loading

i-be-snek commented Jul 11, 2024 •

edited

Loading

liniiiiii commented Jul 19, 2024 •

edited

Loading

liniiiiii commented Jul 23, 2024 •

edited

Loading

i-be-snek commented Sep 24, 2024 •

edited

Loading