-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
USAS taxonomy #202
Comments
I'm not sure we can remove the subcategory separators entirely as this would lead to duplicate IDs between some categories e.g. Regarding changing |
Good suggestions, thanks. I made a simple XSLT that implements these modifications in e7bbb29. As the original taxonomy is still there, we can still change the format of the IDs, if @perayson has better suggestions. Note that the corpus annotations will need to use (references to) IDs, rather than the labels. |
The tagset uses '.' as a separator (rather than colon) to distinguish levels in the hierarchy. The core tagset of 232 tags is defined here: https://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf with descriptions of the plus/minus subcategories here: https://ucrel.lancs.ac.uk/usas/semtags_subcategories.txt (which is what @matthewcoole has converted I believe). Normally, you can have up to three '+' or '-' to indicate antonyms, comparatives and superlatives, see here for more details: https://ucrel.lancs.ac.uk/usas/usas_guide.pdf. There will be a problem if this needs to be used for validation since the tagger can in theory combine two or more of these tags together with a '/' separator to indicate that a coarse grained sense fits into two or more parts of the taxonomy. |
Thanks @perayson for the links, very useful. There could be a problem with several pluses or minuses, if we don't know in advance which categories can have them and how many. However, we could do a bottom-up approach, by first tagging, and then inserting all found tags in the taxonomy. But on p1 of https://ucrel.lancs.ac.uk/usas/usas_guide.pdf I also see: These could be a problem, if the tool actually outputs such symbols, as then the idea that each USAS tag can be represented as an ID starts to become increasingly suspect, and we might have to think about some sort of decomposition of the tags. |
I figured the additional symbols could just be added to the taxonomy. The initial taxonomy was generated purely from the semantic categories, but we could add to it. Then the additional symbols used can just be added in the |
This would work for gender, but e.g. "%" is a modifier of a particular semantic label and if you have two (say How many times this happens in practice is another questions - if never, then there is no problem, and this is a great solution! |
But if we use
|
I find it unusual to have the same token targeted twice, but, indeed, why not? So this would solve all the problems. Polarity would presumably also be a "modifier" like "rare". @matyaskopp, do you agree? |
I don't see a straightforward solution. An example from the documentation(https://ucrel.lancs.ac.uk/usas/usas_guide.pdf):
A word bunker can be labelled with two tags:
both these two tags can be split into two pointers: <ptr ana="usas:G3 usas:H1" target="#word1"/>
<ptr ana="usas:K5.1 usas:W3" target="#word1"/> But this implementation will fail when you want to add other symbols: % @ f m ...
modifiers So I think that the whole tag should be represented with one id - it represents one semantic meaning of a word. @matthewcoole how large is the semantic tagset?
|
Or does the USAS tagger assign only one tag based on context? |
Ok, my only thought if we really want to go down this line is that the So in the above example
I think in other projects using USAS we have simply dropped subsequent tags and only taken the first and most likely tag, i.e. |
You are now very close to the |
So would this mean adding an
|
no I think it can be implemented like "mte:" prefix in ParlaMint-SI: ParlaMint/Data/ParlaMint-SI/ParlaMint-SI_2014-08-25-SDZ7-Izredna-01.ana.xml Lines 127 to 130 in bc6257e
and prefix definition, that loads external resource: ParlaMint/Data/ParlaMint-SI/ParlaMint-SI.ana.xml Lines 600 to 604 in bc6257e
So USAS will not be defined with taxonomy but with the big list of feature structures, that will be stored in an external file. |
If I understand correctly, @matyaskopp proposes to use atomic USAS tags, but that they are pointers to an externaly defined feature structure library:
|
The rarity markers are used internally as part of one disambiguation method. And the lower case letters are somewhat of a legacy from part of the analysis pipeline where we were trying to link anaphora. In Wmatrix, all these lower case letters and rarity markers are removed when it calculates frequencies, so I believe that they can be ignored here as well, so @matthewcoole could we remove them from the tagged output as part of your script? |
There are 232 main tags before any subcategories with positive/negative.
In theory no limit, but normally two, potentially it could be four if two slash tags are joined together by one of the MWE rules.
Up to three + and up to three -
Yes, in theory, but in practice not. |
What you're seeing in the vertical output is the list of all possible semantic tags, including those where you have a MWE tag preceding the single word semantic tags. If you select horizontal output on http://ucrel-api.lancaster.ac.uk/usas/tagger.html then it just picks the first one in the list which is the most likely. We could do the same for ParlaMint corpora. Of course, it is a precision versus recall trade off, and different users might prefer different things. In the MELC project, http://wp.lancs.ac.uk/melc/ we found that less likely tags lower down the list were good candidates for metaphor source and target domains, so Wmatrix retains them, but for frequency lists, only the first choice tag is used. |
Not sure I understand yet exactly what is required, but here https://github.com/UCREL/Multilingual-USAS might be a good place to store this or a new repo in the UCREL space would be fine. We're thinking that we'll need some tag verification potentially as part of https://pypi.org/project/pymusas/ as well but users should be allowed to extend the taxonomy for their own languages or domains if required. |
It is probably better to use the first tag if the tags are sorted because we cannot sort these tags in TEI simply. These two encodings of the word admiral are in TEI equal: <w ana="usas:G3/M4/S2mf usas:S7.1+/S2mf">admiral</w>
<w ana="usas:S7.1+/S2mf usas:G3/M4/S2mf">admiral</w> |
I think if we take only the first and most likely semantic tag(or combination), as suggested by @perayson in #204, and we also drop the additional markers and lowercase symbols used, we can code things up as suggested. e.g.
would simply become
Then expansion of the sub-categories would only be necessary for up to 3 + (pluses) or - (minuses). This would mean the taxonomy wouldn't need to get orders of magnitude bigger (as it would if expanding all sub-categories with variations for symbols like |
Sorry for falling asleep on this issue. But now we need to decide. I like the suggestion above very much - @perayson, do you agree that we do it like this? |
On a related note: the current taxonomy has one plus or minus for a category, as is covered by semtags_subcategories.txt. However, tags can have up to 3 pluses or minuses. The USAS Guide states:
So, would it be ok to add "comparative" and "superlative" to the taxonomy for such cases? ParlaMint/Corpora/Taxonomies/ParlaMint-taxonomy-USAS.ana.xml Lines 547 to 558 in 7785363
This would be expanded to:
Would this work, or do I misunderstand? |
As I understand I think it is ok this way. But I am not sure if we have made a final decision for other modifiers:
|
Yes, I'm fine with removing In terms of the up to three plusses or minuses, not all semantic tags will use all these combinations. Do you have to include all these options in the XML taxonomy? |
Great, glad to hear it! In the meantime I made a draft implemetion of conversion to XML which produces structures such as
So, in a way, we can have our cake and eat it too.
Probably not, I guess we could filter according to which tags are actually used in the corpora. |
@perayson, I've been working with inserting missing "comparative" or "superlative" (so ++, +++, --, ---) categories into the USAS taxonomy. But I ran into a problem. You wrote:
However, there are tags such as "I1-", "N1+", which are not in the semtags_subcategories.txt and hence not in the taxonomy, so, I don't know what they mean. My impression was that only two or three plusses/minuses need to be inserted, as there the interpretations is straighforward, but I have not idea how to gloss the antonym of "Money, generally" in an automatic way (and that it is actually helpful). That said (but I haven't done an extensive test), these pesky tags do not seem to appear in the first position, i.e. as the most likely tag, so the taxonomy might be able to do without them. Still, it seems nice for it to encompass all exsiting tags. Some examples if 'I1-' with only relevant columns retained:
|
@perayson, another problem: quite a few annotations have a "D" code, e.g.
However, looking at https://ucrel.lancs.ac.uk/usas/ there is no D top level code. What shall we do with this? |
@perayson, I can't fully processes the already received -en corpora without having the taxonomy finalised, as it is also a part of the distribution and e.g. the vertical files for the concordancers need it in place.
|
Just getting back to this after the CLARIN conference and LREC-COLING deadline ... the tags can have up to three plusses or minuses. As explained in https://ucrel.lancs.ac.uk/usas/usas_guide.pdf "Antonymity of conceptual classifications is indicated by +/- markers on tags" and https://ucrel.lancs.ac.uk/usas/semtags_subcategories.txt is not an exhaustive list, so for a gloss for anything missing from that list, you can back off to the main category after removing any +/- extensions. My suggestion would be to stick to the 232 main categories listed on https://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf and https://ucrel.lancs.ac.uk/usas/semtags.txt for the taxonomy itself. The back off is what Wmatrix does where the specific subcategory is not glossed in the additional list. |
In terms of the "D" category, there isn't one but I expect this originates from the "Df" tag from which you're then removing the "f" as discussed above. This is an unimplemented feature in PyMUSAS (UCREL/pymusas#26) that affects a small number of MWEs in the English lexicon. Unless you fancy implementing the transfer of the other tag as described on the issue, then I would recommend the following fix:
|
Well, @matthewcoole made the taxonomy based on https://ucrel.lancs.ac.uk/usas/semtags_subcategories.txt , so I would prefer now to stick to that. Note that USAS taxonomy has two more categories, as A1.7, A1.8 are missing from semtags_subcategories.txt. Now that I have all the corpora, I could make an exhausitve list of all first tags (so, before the first comma), so it was much easier to make a mapping and check that everything is covered: I've implemented your suggestion for removing / substituting D with Z9 and for backing off for the other tags unknown in the taxonomy. This mostly involves removing a plus or minus, however, there are also some weird tags, in particular: A1.2.4-, A9.1+, G1.1.1, S.1.2.3-, S2F, S4T1.1.1, X7.2+, which I also fix to whatever is ok (A1.2, A9, etc.). As for encoding USAS tags in TEI:
Here is an example:
So, with this, we are set for a trial run on one of the corpora. Stay tuned! |
A few general MWE notes/remarks:
<seg xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1" xml:lang="en" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1">
<s xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1" n="1" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1">
<phr type="sem" function="Z1mf,Z3c" ana="sem:Z1">
<w pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Mr.">Mr.</w>
<w pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="President" join="right">President</w>
</phr>
<pc pos="Z" msd="UPosTag=PUNCT" function="Z9" ana="sem:Z9" join="right">-</pc> MWE conflicts:In general we have 3 types of interferences and only one of them is a conflict:
# sent_id = ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10
# source = Latvijas tauta realizē savu varu ar ievēlēto deputātu starpniecību, tāpēc es aicinu vienmēr atcerēties, ka ikviena deputāta darba devējs ir Latvijas tauta, un aicinu savus pienākumus pildīt godprātīgi, ar pašcieņu, pēc labākās apziņas.
# text = The people of Latvia exercise their authority through the elected Members, so I call always to remember that each Member's employer is the people of Latvia and call on them to carry out their duties in good faith, with self-esteem, with the best consciousness.
1 The the DET DT Definite=Def|PronType=Art 0 _ _ ForwardAlignment=1|BackwardAlignment=1|NER=O|SpacyLemma=the|SpacyUPoS=DET|SpacyXPoS=DT|SEMMWE=O|SEM=Z5
2 people people NOUN NNS Number=Plur 1 _ _ ForwardAlignment=2|BackwardAlignment=2|NER=O|SpacyLemma=people|SpacyUPoS=NOUN|SpacyXPoS=NNS|SEMMWE=B|SEM=Z2,Z3c
3 of of ADP IN _ 2 _ _ NER=O|SpacyLemma=of|SpacyUPoS=ADP|SpacyXPoS=IN|SEMMWE=I|SEM=Z2,Z3c
4 Latvia Latvia PROPN NNP Number=Sing 3 _ _ ForwardAlignment=1|NER=B-LOC|SpacyLemma=Latvia|SpacyUPoS=PROPN|SpacyXPoS=NNP|SEMMWE=I|SEM=Z2,Z3c
5 exercise exercise VERB VBP Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin 4 _ _ ForwardAlignment=3|BackwardAlignment=3|NER=O|SpacyLemma=exercise|SpacyUPoS=VERB|SpacyXPoS=VBP|SEMMWE=B|SEM=A5.4+
6 their they PRON PRP$ Number=Plur|Person=3|Poss=Yes|PronType=Prs 5 _ _ ForwardAlignment=4|BackwardAlignment=4|NER=O|SpacyLemma=their|SpacyUPoS=PRON|SpacyXPoS=PRP$|SEMMWE=I|SEM=A5.4+
7 authority authority NOUN NN Number=Sing 6 _ _ ForwardAlignment=5|BackwardAlignment=5|NER=O|SpacyLemma=authority|SpacyUPoS=NOUN|SpacyXPoS=NN|SEMMWE=O|SEM=G1.1c,S7.1+,S7.4+,X2.2+ current encoding (there is also a bug - <s xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10" n="10" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10">
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t1" pos="DT" msd="UPosTag=DET|Definite=Def|PronType=Art" lemma="the" function="Z5" ana="sem:Z5">The</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t2" pos="NNS" msd="UPosTag=NOUN|Number=Plur" lemma="people" function="Z2,Z3c" ana="sem:Z2">people</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t3" pos="IN" msd="UPosTag=ADP" lemma="of" function="Z2,Z3c" ana="sem:Z2">of</w>
<name type="LOC">
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t4" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Latvia" function="Z2,Z3c" ana="sem:Z2">Latvia</w>
</name>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t5" pos="VBP" msd="UPosTag=VERB|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin" lemma="exercise" function="A5.4+" ana="sem:A5.4p">exercise</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t6" pos="PRP$" msd="UPosTag=PRON|Number=Plur|Person=3|Poss=Yes|PronType=Prs" lemma="they" function="A5.4+" ana="sem:A5.4p">their</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t7" pos="NN" msd="UPosTag=NOUN|Number=Sing" lemma="authority" function="G1.1c,S7.1+,S7.4+,X2.2+" ana="sem:G1.1">authority</w> it should be (simplified): <s xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10" n="10" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10">
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t1" pos="DT" msd="UPosTag=DET|Definite=Def|PronType=Art" lemma="the" function="Z5" ana="sem:Z5">The</w>
<phr><!-- NER inside SEMMWE -->
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t2" pos="NNS" msd="UPosTag=NOUN|Number=Plur" lemma="people" function="Z2,Z3c" ana="sem:Z2">people</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t3" pos="IN" msd="UPosTag=ADP" lemma="of" function="Z2,Z3c" ana="sem:Z2">of</w>
<name type="LOC">
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t4" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Latvia" function="Z2,Z3c" ana="sem:Z2">Latvia</w>
</name>
</phr><!-- NER inside SEMMWE -->
<phr><!-- bugfix -->
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t5" pos="VBP" msd="UPosTag=VERB|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin" lemma="exercise" function="A5.4+" ana="sem:A5.4p">exercise</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t6" pos="PRP$" msd="UPosTag=PRON|Number=Plur|Person=3|Poss=Yes|PronType=Prs" lemma="they" function="A5.4+" ana="sem:A5.4p">their</w>
</phr><!-- bugfix -->
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t7" pos="NN" msd="UPosTag=NOUN|Number=Sing" lemma="authority" function="G1.1c,S7.1+,S7.4+,X2.2+" ana="sem:G1.1">authority</w>
# sent_id = ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43
# source = Tāpat arī joprojām uzturu spēkā un esmu pārliecināts, ka būs jāatgriežas pie iepriekšējam Saeimas sasaukumam iesniegtajām izmaiņām Satversmē, Ministru kabineta iekārtas likumā un Saeimas kārtības rullī, kuru mērķis ir padarīt stiprāku un atbildīgāku izpildvaru.
# text = I also continue to feed in force and I am sure that I will have to return to the changes submitted to the previous parliamentary term in the Constitution, the Cabinet of Ministers’ Equipment Law and the Saeima order roll, which aim to make the executive power stronger and more responsible.
### ...
31 the the DET DT Definite=Def|PronType=Art 30 _ _ ForwardAlignment=21|NER=O|SpacyLemma=the|SpacyUPoS=DET|SpacyXPoS=DT|SEMMWE=O|SEM=Z5
32 Cabinet cabinet PROPN NNP Number=Sing 31 _ _ ForwardAlignment=22|BackwardAlignment=21|NER=B-ORG|SpacyLemma=cabinet|SpacyUPoS=PROPN|SpacyXPoS=NNP|SEMMWE=O|SEM=H5,G1.1
33 of of ADP IN _ 32 _ _ ForwardAlignment=21|NER=I-ORG|SpacyLemma=of|SpacyUPoS=ADP|SpacyXPoS=IN|SEMMWE=O|SEM=Z5
34 Ministers Minister PROPN NNPS Number=Plur 33 _ _ ForwardAlignment=22|BackwardAlignment=22|NER=I-ORG|SpaceAfter=No|SpacyLemma=Minister|SpacyUPoS=PROPN|SpacyXP
oS=NNP|SEMMWE=O|SEM=G1.1/S2mf,S9/S2mf
35 ’ 's PART POS _ 34 _ _ NER=I-ORG|SpacyLemma='s|SpacyUPoS=PUNCT|SpacyXPoS=''|SEMMWE=O|SEM=A9+,A3+,A2.2,Z5
36 Equipment equipment PROPN NNP Number=Sing 35 _ _ ForwardAlignment=23|BackwardAlignment=23|NER=I-ORG|SpacyLemma=equipment|SpacyUPoS=PROPN|SpacyXPoS=NNP|SEMMWE=B|SEM=Z1mf,Z3c
37 Law Law PROPN NNP Number=Sing 36 _ _ ForwardAlignment=24|BackwardAlignment=24|NER=I-ORG|SpacyLemma=Law|SpacyUPoS=PROPN|SpacyXPoS=NNP|SEMMWE=I|SEM=Z1mf,Z3c
38 and and CCONJ CC _ 37 _ _ ForwardAlignment=25|BackwardAlignment=25|NER=O|SpacyLemma=and|SpacyUPoS=CCONJ|SpacyXPoS=CC|SEMMWE=O|SEM=Z5 Current encoding: <w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t31" pos="DT" msd="UPosTag=DET|Definite=Def|PronType=Art" lemma="the" function="Z5" ana="sem:Z5">the</w>
<name type="ORG">
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t32" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="cabinet" function="H5,G1.1" ana="sem:H5">Cabinet</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t33" pos="IN" msd="UPosTag=ADP" lemma="of" function="Z5" ana="sem:Z5">of</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t34" pos="NNPS" msd="UPosTag=PROPN|Number=Plur" lemma="Minister" function="G1.1/S2mf,S9/S2mf" ana="sem:G1.1 sem:S2" join="right">Ministers</w>
<pc xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t35" pos="Z" msd="UPosTag=PUNCT" function="A9+,A3+,A2.2,Z5" ana="sem:A9p">’</pc>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t36" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="equipment" function="Z1mf,Z3c" ana="sem:Z1">Equipment</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t37" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Law" function="Z1mf,Z3c" ana="sem:Z1">Law</w>
</name>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t38" pos="CC" msd="UPosTag=CCONJ" lemma="and" function="Z5" ana="sem:Z5">and</w> it should be (simplified): <w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t31" pos="DT" msd="UPosTag=DET|Definite=Def|PronType=Art" lemma="the" function="Z5" ana="sem:Z5">the</w>
<name type="ORG">
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t32" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="cabinet" function="H5,G1.1" ana="sem:H5">Cabinet</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t33" pos="IN" msd="UPosTag=ADP" lemma="of" function="Z5" ana="sem:Z5">of</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t34" pos="NNPS" msd="UPosTag=PROPN|Number=Plur" lemma="Minister" function="G1.1/S2mf,S9/S2mf" ana="sem:G1.1 sem:S2" join="right">Ministers</w>
<pc xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t35" pos="Z" msd="UPosTag=PUNCT" function="A9+,A3+,A2.2,Z5" ana="sem:A9p">’</pc>
<phr><!-- SEMMWE inside NER-->
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t36" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="equipment" function="Z1mf,Z3c" ana="sem:Z1">Equipment</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t37" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Law" function="Z1mf,Z3c" ana="sem:Z1">Law</w>
</phr><!-- SEMMWE inside NER-->
</name>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t38" pos="CC" msd="UPosTag=CCONJ" lemma="and" function="Z5" ana="sem:Z5">and</w>
# sent_id = ParlaMint-LV_2014-11-04-PT12-264-U2-P1.24
# source = Šai Saeimai arī pēc prezidentūras būs ievērojami jāpastiprina darbs ar Eiropas Savienības institūcijām, it sevišķi laicīgi izvērtējot jauno Eiropas Savienības iniciatīvu ietekmi uz Latviju, un arī iepriekš radīto problēmjautājumu, piemēram, obligātā iepirkuma komponentes ietekmes, risinājumu meklēšana.
# text = The Saeima will also need to step up its work with the European Union institutions after the Presidency, in particular by assessing in a timely manner the impact of the new European Union initiatives on Latvia, and also by looking for solutions to the problems that have arisen in the past, such as the impact of the mandatory procurement component.
### .....
31 the the DET DT Definite=Def|PronType=Art 30 _ _ NER=O|SpacyLemma=the|SpacyUPoS=DET|SpacyXPoS=DT|SEMMWE=O|SEM=Z5
32 new new ADJ JJ Degree=Pos 31 _ _ ForwardAlignment=19|BackwardAlignment=19|NER=O|SpacyLemma=new|SpacyUPoS=ADJ|SpacyXPoS=JJ|SEMMWE=B|SEM=T1.3
33 European European ADJ NNP Degree=Pos 32 _ _ ForwardAlignment=20|BackwardAlignment=20|NER=B-ORG|SpacyLemma=European|SpacyUPoS=PROPN|SpacyXPoS=NNP|SEMMWE=
I|SEM=T1.3
34 Union Union PROPN NNP Number=Sing 33 _ _ ForwardAlignment=21|BackwardAlignment=21|NER=I-ORG|SpacyLemma=Union|SpacyUPoS=PROPN|SpacyXPoS=NNP|SEMMWE=B|SEM=T1.3
35 initiatives initiative NOUN NNS Number=Plur 34 _ _ ForwardAlignment=22|BackwardAlignment=22|NER=O|SpacyLemma=initiative|SpacyUPoS=NOUN|SpacyXPoS=NNS|SEMMWE=I|SEM=T1.3
36 on on ADP IN _ 35 _ _ ForwardAlignment=23|BackwardAlignment=24|NER=O|SpacyLemma=on|SpacyUPoS=ADP|SpacyXPoS=IN|SEMMWE=O|SEM=Z5 Here I am not sure about the solution, considering my suggestion above:
|
@matyaskopp, thanks for your input. Below my comments:
Not sure why we would do this, because:
So, I wouldn't do this now. If you feel this is sensible (presumably both for name and phr) pls. open a Future issue and explain why this would be necessary.
I was thinking about this but then decided to keep the token level function and ana attributes even inside phrases, mostly because the main platform we have for using the corpora are noSketch(-like) concordancers, and there structural attributes are rather useless and esp. cannot be used in tandem with positional ones. So, e.g. if you do a frequency lexicon on semantic categories (i.e. on positional attributes) you would not get any results for those tokens inside phrases. This to me is wrong, as the results would never show semantic categories of any words inside phrases. As for the "conflicts" of phr with name: yes, I am aware that I also miss out on phr/name and name/phr, which is why I was careful to write "I do not encode phr if it interferes with name". One reason I avoid this is that the schema gets messy but the main one was that the program to convert CoNLL-U to TEI (i.e. conllu2tei.pl) was written to convert names, but not phrases, and the additions to deal with phrases is a hack, c.f. To do the name/phr or phr/name nesting the script would need to be completely re-done (with look-ahead to decide what is the correct option, i.e. to either remove a phrase or to keep it, and, if so, what the nesting is) and I don't have the time to do it now (esp. as the processing of the -en corpus will take a week or so). This unfortunaltelly also goes for phrases which are next to names, a problem I wasn't aware of. |
I don't think it is necessary. But initially, we thought that it was not necessary to have
At this point, I disagree - the TEI format should be the most brilliant format because other file formats (conllu, vert,...) are derived from it. So the design of this format shouldn't be noSketch/TEITOK-driven because this format needs to be semantically correct. The derived format can have the information to tokens: eg both TEITOK and noSketch format has UD syntax in tokens side note (we probably do not have time to implement it now, but the idea can be raised in future) <seg xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1" xml:lang="en" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1">
<s xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1" n="1" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1">
<w xml:id="tok01" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Mr.">Mr.</w>
<w xml:id="tok02" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="President" join="right">President</w>
<pc xml:id="tok03" pos="Z" msd="UPosTag=PUNCT" join="right">-</pc>
<!-- ... -->
<spanGrp type="sem">
<span target="#tok01 #tok02" type="Z1mf,Z3c" ana="sem:Z1"/>
<span target="#tok03" type="Z9" ana="sem:Z9"/>
</spanGrp>
</s>
</seg> but the |
The points here have been mostly solved, what remains should be taken up in #827. |
The current version of Data/Taxonomies/usas-taxonomy.xml uses the attribute
@id
instead of@xml:id
and the values of these IDs are not valid as they contain ID-illegal characters:
and+
.I suggest that in the ID values the colon is simply removed, while plus goes to
P
and minus (i.e. hyhpen) goes toM
. @matthewcoole, would you agree? Any other mapping is of course also ok, just as long as we have a valid NCName for the ID values.It would also be a good idea to give the whole taxonomy a description in the
desc
element, but this holds for all taxonomies and we need another issue of that.The text was updated successfully, but these errors were encountered: