Annotation for Wikimpacts V2 #197

liniiiiii · 2024-12-03T20:16:39Z

liniiiiii
Dec 3, 2024
Maintainer

Hi all, we have a long list of emails with discussion for location annotation of V2, now I summarize the points we have so far, and we can decide here.

From Shorouq's suggestion: is to choose the smallest possible GID for locations that don't have an explicit GID so that we wouldn't have to discarding them.
From Mariana's suggestion: use "Adminstrative Area" instead of "country" to make our database more neutral (which we already used Adminstrative Area in V1, the only point is that since GADM uses "country" term as shown below, and it also contains some regions in Adminstrative Level 0, therefore, we add some explanations in the ESSD paper we are editing now about this special case).
about the annotation tool, if we want to use GIDs, I think SQL with python script constrains would be a better option to automatically get GIDs when Camila just fill a location string or a list of location strings.
If we use GIDs, then we only need one column for location information, but I have a question about when the location is too small and without GIDs, and follow Shorouq's suggestion, we use a smallest GID which cover this location, can we use the same normalization process for the model output as well? Otherwise they can't be evaluated. If we can solve this issue, then I think we are in the same page of the location annotation and we are good to proceed.

welcome to comment and wish we can make an agreement soon, thanks!

i-be-snek · 2024-12-04T09:41:31Z

i-be-snek
Dec 4, 2024
Maintainer

Thanks for summarizing all these points. :)

I wanted to create a PoC for my proposal, so here is a brief overview of the idea.

First, I manually labelled three articles from Wikipedia for three categories: deaths, injuries, and number of residential buildings destroyed. I didn’t cheat from our existing annotation table so some differences may exist. Here is a snapshot:

Each category has its own table in a separate sheet. Essentially, each sheet is a list of every mention of an impact in the Wikipedia article (see the excel notes for the verbatim source). The GID was determined either by simple lookup or by visiting the location’s Wikipedia page to check which country or town or district it belongs to. We can automate this process using OpenStreetMap + modifying existing functions that grab the GID + having GADM in a RAG). If there is a town not represented by a GID, the deepest GID level within which it falls is selected (for example, I live in Salabacke, which is a neighbourhood in Uppsala, Sweden which has no GID — I would choose the GID “SWE.16.7_1” to represent it; so not the Uppsala region, but a level deeper: https://gadm.org/maps/SWE/uppsala/uppsala.html).

By treating the annotations as a single list of impact information, we can adequately measure how well the LLMs extract this information without worrying about levels.

After extracting the data, the challenge is to determine which numbers overlap so that quantitative columns can be aggregated correctly. I wrote a quick script that takes the data from excel and checks every row for intersections. It makes a few assumptions:

(a) GIDs are treated like sets, so if impacts occurred on the same dates in a large area like a country (i.e predefined some set of GIDs), then all the impacts with deeper GID levels for this particular area are subsets of it; and

(b) the impact numbers in the larger GID set must be higher than the aggregates of deeper GIDs for the same location and the same dates.

Here is an example from the play data in the excel sheet:

The article mentions that 13 people died in total in four countries: Cuba, Dominican Republic, USA, and Martinique (row 3). The algorithm checks which of these figures overlap; here, it found that the entry hpnS (on row 4) contains within it the mention of a single death in USA.10.15_1 (aka in Duval Country: F23j, row 0). The algorithm also determined that the one death reported in Duval County is probably the same death listed in row 2 for CUB (Cuba) and USA.10_1 (Florida). It also finds which impact rows are included in the 13 deaths reported in XwcP. It ensures that the sum of these doesn’t exceed 13 and lists the impact rows which are subsets of it are in the “Subsets” column.

Here is an example of what happens when there is an inconsistency:

(...)

In the image above, the article reports that USA.47_1 (Virginia) had 2-9 homes destroyed. But on the same date, we have a report of 110 homes being destroyed in USA.47.120_1 (Suffolk, Virginia). To consolidate this conflict, a new row uJkB is created (row 43). Here, we see that its “Inferred” flag is set to True, meaning that it’s a row we added ourselves that didn’t come directly from the article. The number 110 is thus contained within impacts wzn4 (row 37) and dL7a (row 44). So if we want to query the total number of residential buildings destroyed across the US, we can ignore any rows mentioned in the last column “Subsets”.

It’s likely that this approach has problems, especially in the assumptions it makes on how to determine overlap over different locations. The querying function should be deterministic, always returning the same output based on the input.

You can access the mini database and code here.

5 replies

liniiiiii Dec 4, 2024
Maintainer Author

Hi @i-be-snek , thanks for this detailed proposal, really nice and helpful, I have few questions below:

You are using excel, and the GIDs are filled with python after you get the Area names if I'm not mistaken referring the code you are sharing.
The post-processing is used for the database, not for annotation right? Because we want the model to capture information in any location levels, without classification.
For the inconsistency, I wonder if the same location and same date like row 0 and 1 can be merged together, and also expand with the range from row 34 and 35, since in row 34, there are 12 houses destroyed, and in row 0, the max is 9. As you proposed, we follow the GID level to aggregate information, and we will have overlap for sure, only with large scale database, we would know if the overlap is severe or minor, and I think we can discuss it when we have the full run raw output from llm?

i-be-snek Dec 4, 2024
Maintainer

I have not used the function, instead I looked it up myself, but if you have the official name of an area, the code should be able to find it. So no, I entered the GIDs directly. One idea could be to have a RAG solution so that the LLM determines the area, queries it in OpenStreetMap if needed, then looks it up in the GADM database and spits out a GID instead of an area name.
That's right. The post-processing is used for extracting useful aggregations from the database, but the annotations should look like they do in the excel file. In other words: the LLM should infer the GID, but we do the extra work in post-processing to determine the subsets and supersets.
I don't think you should merge areas of the same level depth with the same date (like rows 0 and 1) because they represent different instances mentioned in the articles. But when you want to query the database, rows 0 and 1 can be summed up so that we know 3-10 homes were destroyed in the set (USA.8.1_1, USA.8.2_1). We would need to decide when to assign subsets and when not to because the rows may represent are separate impacts that occurred on the same day but they aren't about the same destroyed homes! However, I see the point about row 34 and 35, so the algorithm should have created a new record with areas (USA.8.1_1, USA.8.2_1) an an aggregated number of 13 (min) and 13 (max) if we assume that rows 35 and 35 are subsets of row 0.

These assumptions and rules can be modified when there is a clearer idea about what the data looks like in reality. For example, I am assuming that people on Wikipedia would mention the total number of casualties in USA and then they will try to count the instances inside the USA -- which makes me assume that the smaller areas are all subsets of the USA impact numbers. In rows 34 and 35, the impact occurs in the same state but in different counties. I don't think they overlap, but that's because I manually annotate the toy dataset myself 😁

Also, for most of these rows, the real date is missing. The Wikipedia article may not mention the exact date, but the news article it cites may do that. So I pre-fill rows with missing dates using the summary date of the event. This is more clear in the excel sheet as you can see here:

So if "Exact Date" is "F" (meaning False), I use the metadata date which we usually find in the info-box:

This means that it's possible these impacts didn't really happen on the same day, so we could maybe treat records with exact dates differently than ones with an inferred date.

liniiiiii Dec 6, 2024
Maintainer Author

Hi @i-be-snek , I wonder when you annotate 3 articles in excel, is it easy to run python script we have now to get the GIDs etc . I'm thinking that we develop a new python script to get GIDs for the annotated locations where the location is small, and we use the deepest GID to cover it. So that Camila only needs to copy the location name from the text, and we can automatically normalize the GIDs.

i-be-snek Dec 6, 2024
Maintainer

I think it would have to be something like this:

I think @koffiworou may also be interested in something like that that.

i-be-snek Dec 6, 2024
Maintainer

About running python scripts -- easy, yes, but it cannot be integrated with excel as far as I know unless you find some service that does that.

i-be-snek · 2024-12-04T11:40:10Z

i-be-snek
Dec 4, 2024
Maintainer

I also want to say in general that it's inadvisable to try to make a decision on this by Friday 6th of December. Designing a database that contains a unique set of information is a cumbersome effort -- there is a lot to think about, with solid data engineering and management practices. This part of designing the V2 structure needs time and requires resources to be dedicated for doing a pilot run of annotations and modifying or improving the post-processing functions. If the design is clear, the annotations can go faster than last time. We need a schema, annotation instructions, pilot annotations to test on and think about, etc. I also don't think it's enough for a single person to annotate. If @camitrba is the only annotator, we are already making a big mistake.

1 reply

liniiiiii Dec 4, 2024
Maintainer Author

We could check the availability of additional human resources for annotation. I fully agree that for Version 2, it is essential to establish a clear schema and annotations. Let us take more time to deliberate on this rather than proceeding hastily.

liniiiiii · 2024-12-04T20:41:58Z

liniiiiii
Dec 4, 2024
Maintainer Author

An idea just comes to my mind about the impact information extraction with llm, is that we ask the model to extract the text containing the impact, and we use our defined keywords to normalize the results, and put them into related categories, instead of asking the model to give a number directly, and in the annotation , we can also leave the keyword in the table, I don't know if this will work better for our database. Welcome any idea and comments on it, thanks!

3 replies

i-be-snek Dec 5, 2024
Maintainer

Do you mean that you want the LLM to find the sentence, then we extract the number from it (maybe using another LLM/prompt)?

liniiiiii Dec 5, 2024
Maintainer Author

Yes, @i-be-snek , like using the keywords matching techniques as @jnivre said, since in the evaluation process from V1, I found the model can't follow our keywords constraint to extract information and I wonder if this would be a bit better? I don't know if it's worthy to try, and from the error analysis I conducted for the impact information, 1st issue is the mismatch of L2 and L3 classification, so the model will capture information duplicated, which we will neglect the level of location in V2 as we discussed to solve this issue. 2nd issue is that, the model didn't follow our definition of the impacts, so it captures many undefined impacts, which we also have not filtered them in the V1 as well. I'm wondering, if the model captures the whole sentence for the impact number, could we check them in the post-processing and filter some undefined impact information out, or even leave them as added impact categories if needed.

i-be-snek Dec 5, 2024
Maintainer

I don't think you can really create guardrails for LLMs using keywords in prompting. So even though you may try to create a keyword constraint, it will not work as intended in this case. I think the keywords are good for giving the LLM examples of preferred output, but I don't think it works to use them to restrict the output. For the impact definitions and classification, you could also consider simpler classification models rather than an LLM.

There are many LLM prompting techniques that we have no had the time to experiment with. Now may be a good time to explore those.

jnivre · 2024-12-05T11:41:45Z

jnivre
Dec 5, 2024
Maintainer

If I understand your idea correctly, this would mean bypassing the language understanding capabilities of the LLM and instead using keyword matching techniques. The only advantage I can see with this approach is more transparency, since we can explain exactly why a particular piece of information has been extracted, but in terms of performance I think it would be vastly inferior to using the LLM for language understanding. From: Ni Li ***@***.***> Reply to: VUB-HYDR/Wikimpacts ***@***.***> Date: Wednesday, 4 December 2024 at 20:42 To: VUB-HYDR/Wikimpacts ***@***.***> Cc: Subscribed ***@***.***> Subject: Re: [VUB-HYDR/Wikimpacts] Annotation for Wikimpacts V2 (Discussion #197) An idea just comes to my mind about the impact information extraction with llm, is that we ask the model to extract the text containing the impact, and we use our defined keywords to normalize the results, and put them into related categories, instead of asking the model to give a number directly, and in the annotation , we can also leave the keyword in the table, I don't know if this will work better for our database. Welcome any idea and comments on it, thanks! — Reply to this email directly, view it on GitHub<#197 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABZ7ZVUKRT2NX55NPJI5ZY32D5SKXAVCNFSM6AAAAABS6SSPTKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNBWGU4DAOI>. You are receiving this because you are subscribed to this thread.Message ID: ***@***.***> VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert. CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe. När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/ E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy

0 replies

jnivre · 2024-12-05T16:52:47Z

jnivre
Dec 5, 2024
Maintainer

I agree. Keyword matching is an old technique with well-known limitations. It seems much better to explore more advanced LLM techniques if the current prompts do not give the desired results. From: Shorouq ***@***.***> Reply to: VUB-HYDR/Wikimpacts ***@***.***> Date: Thursday, 5 December 2024 at 15:43 To: VUB-HYDR/Wikimpacts ***@***.***> Cc: Joakim Nivre ***@***.***>, Mention ***@***.***> Subject: Re: [VUB-HYDR/Wikimpacts] Annotation for Wikimpacts V2 (Discussion #197) I don't think you can really create guardrails for LLMs using keywords in prompting. So even though you may try to create a keyword constraint, it will not work as intended in this case. I think the keywords are good for giving the LLM examples of preferred output, but I don't think it works to use them to restrict the output. For the impact definitions and classification, you could also consider simpler classification models rather than an LLM. There are many LLM prompting techniques that we have no had the time to experiment with. Now may be a good time to explore those. — Reply to this email directly, view it on GitHub<#197 (reply in thread)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABZ7ZVWBFND4RHIAOEEQY4D2EBYCJAVCNFSM6AAAAABS6SSPTKVHI2DSMVQWIX3LMV43URDJONRXK43TNFXW4Q3PNVWWK3TUHMYTCNBXGUYDSMI>. You are receiving this because you were mentioned.Message ID: ***@***.***> VARNING: Klicka inte på länkar och öppna inte bilagor om du inte känner igen avsändaren och vet att innehållet är säkert. CAUTION: Do not click on links or open attachments unless you recognise the sender and know the content is safe. När du har kontakt med oss på Uppsala universitet med e-post så innebär det att vi behandlar dina personuppgifter. För att läsa mer om hur vi gör det kan du läsa här: http://www.uu.se/om-uu/dataskydd-personuppgifter/ E-mailing Uppsala University means that we will process your personal data. For more information on how this is performed, please read here: http://www.uu.se/en/about-uu/data-protection-policy

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Annotation for Wikimpacts V2 #197

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 5 comments 9 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Annotation for Wikimpacts V2 #197

liniiiiii Dec 3, 2024 Maintainer

Replies: 5 comments · 9 replies

i-be-snek Dec 4, 2024 Maintainer

liniiiiii Dec 4, 2024 Maintainer Author

i-be-snek Dec 4, 2024 Maintainer

liniiiiii Dec 6, 2024 Maintainer Author

i-be-snek Dec 6, 2024 Maintainer

i-be-snek Dec 6, 2024 Maintainer

i-be-snek Dec 4, 2024 Maintainer

liniiiiii Dec 4, 2024 Maintainer Author

liniiiiii Dec 4, 2024 Maintainer Author

i-be-snek Dec 5, 2024 Maintainer

liniiiiii Dec 5, 2024 Maintainer Author

i-be-snek Dec 5, 2024 Maintainer

jnivre Dec 5, 2024 Maintainer

jnivre Dec 5, 2024 Maintainer

liniiiiii
Dec 3, 2024
Maintainer

Replies: 5 comments 9 replies

i-be-snek
Dec 4, 2024
Maintainer

liniiiiii Dec 4, 2024
Maintainer Author

i-be-snek Dec 4, 2024
Maintainer

liniiiiii Dec 6, 2024
Maintainer Author

i-be-snek Dec 6, 2024
Maintainer

i-be-snek Dec 6, 2024
Maintainer

i-be-snek
Dec 4, 2024
Maintainer

liniiiiii Dec 4, 2024
Maintainer Author

liniiiiii
Dec 4, 2024
Maintainer Author

i-be-snek Dec 5, 2024
Maintainer

liniiiiii Dec 5, 2024
Maintainer Author

i-be-snek Dec 5, 2024
Maintainer

jnivre
Dec 5, 2024
Maintainer

jnivre
Dec 5, 2024
Maintainer