Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

USAS taxonomy #202

Closed
TomazErjavec opened this issue Mar 30, 2022 · 36 comments
Closed

USAS taxonomy #202

TomazErjavec opened this issue Mar 30, 2022 · 36 comments
Assignees
Labels
MT machine translation related issues Taxonomy

Comments

@TomazErjavec
Copy link
Collaborator

The current version of Data/Taxonomies/usas-taxonomy.xml uses the attribute @id instead of @xml:id and the values of these IDs are not valid as they contain ID-illegal characters : and +.

I suggest that in the ID values the colon is simply removed, while plus goes to P and minus (i.e. hyhpen) goes to M. @matthewcoole, would you agree? Any other mapping is of course also ok, just as long as we have a valid NCName for the ID values.

It would also be a good idea to give the whole taxonomy a description in the desc element, but this holds for all taxonomies and we need another issue of that.

@TomazErjavec TomazErjavec added the bug Something isn't working label Mar 30, 2022
@matthewcoole
Copy link
Collaborator

I'm not sure we can remove the subcategory separators entirely as this would lead to duplicate IDs between some categories e.g. A11:1 Importance & A1:1:1 General actions / making would both become A111. Could we use periods ., as in the USAS tag descriptions, or perhaps / slashes?

Regarding changing + & -, using letters would probably be fine, but I would suggest p & n for positive and negative. Perhaps @perayson might have a different suggestion.

TomazErjavec added a commit that referenced this issue Mar 30, 2022

Verified

This commit was created on GitHub.com and signed with GitHub’s verified signature.
@TomazErjavec
Copy link
Collaborator Author

Good suggestions, thanks. I made a simple XSLT that implements these modifications in e7bbb29. As the original taxonomy is still there, we can still change the format of the IDs, if @perayson has better suggestions.

Note that the corpus annotations will need to use (references to) IDs, rather than the labels.

@perayson
Copy link
Collaborator

The tagset uses '.' as a separator (rather than colon) to distinguish levels in the hierarchy. The core tagset of 232 tags is defined here: https://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf with descriptions of the plus/minus subcategories here: https://ucrel.lancs.ac.uk/usas/semtags_subcategories.txt (which is what @matthewcoole has converted I believe). Normally, you can have up to three '+' or '-' to indicate antonyms, comparatives and superlatives, see here for more details: https://ucrel.lancs.ac.uk/usas/usas_guide.pdf. There will be a problem if this needs to be used for validation since the tagger can in theory combine two or more of these tags together with a '/' separator to indicate that a coarse grained sense fits into two or more parts of the taxonomy.

@TomazErjavec
Copy link
Collaborator Author

Thanks @perayson for the links, very useful.
I don't see a problem with combining several tags, as the @ana attribute that will be used to hold the IDREF to the USAS taxonomy categories can contain several values, so e.g. A/B just needs to be converted to usas:A usas:B.

There could be a problem with several pluses or minuses, if we don't know in advance which categories can have them and how many. However, we could do a bottom-up approach, by first tagging, and then inserting all found tags in the taxonomy.

But on p1 of https://ucrel.lancs.ac.uk/usas/usas_guide.pdf I also see:
OTHER SYMBOLS UTILISED
% = rarity marker (1)
@ = rarity marker (2)
f = female
m = male
c = potential antecedents of conceptual anaphors (neutral for number)
n = neuter
i = indicates a semantic idiom

These could be a problem, if the tool actually outputs such symbols, as then the idea that each USAS tag can be represented as an ID starts to become increasingly suspect, and we might have to think about some sort of decomposition of the tags.

@matthewcoole
Copy link
Collaborator

I figured the additional symbols could just be added to the taxonomy. The initial taxonomy was generated purely from the semantic categories, but we could add to it. Then the additional symbols used can just be added in the @ana similar to when there are multiple tags e.g. usas:A usas:% usas:f, perhaps some of the symbols will have to be replaced with alternatives so they are valid xml:ids.

@TomazErjavec
Copy link
Collaborator Author

This would work for gender, but e.g. "%" is a modifier of a particular semantic label and if you have two (say A%/B), then you would get usas:A usas:% usas:B and if the implicit logic of such a sequence is AND, you have a problem.

How many times this happens in practice is another questions - if never, then there is no problem, and this is a great solution!

@matthewcoole
Copy link
Collaborator

But if we use <ptr> as suggested, we could separate out each alternative tag into a separate <ptr> with the same target and then include the modifiers that are applicable for it. e.g.

<w id="word1">

...

<ptr ana="usas:A usas:@" target="#word1"/>
<ptr ana="usas:X usas:%" target="#word1"/>

@TomazErjavec
Copy link
Collaborator Author

I find it unusual to have the same token targeted twice, but, indeed, why not? So this would solve all the problems. Polarity would presumably also be a "modifier" like "rare".

@matyaskopp, do you agree?

@matyaskopp
Copy link
Collaborator

I don't see a straightforward solution. An example from the documentation(https://ucrel.lancs.ac.uk/usas/usas_guide.pdf):

bunker = G3/H1 K5.1/W3

A word bunker can be labelled with two tags:

  1. army building
  2. sandy area in golf (I don't play golf, so I am not sure if the description is correct)

both these two tags can be split into two pointers:

<ptr ana="usas:G3 usas:H1" target="#word1"/>
<ptr ana="usas:K5.1 usas:W3" target="#word1"/>

But this implementation will fail when you want to add other symbols: % @ f m ...

Admiral = G3/M4/S2mf S7.1+/S2mf

modifiers m and f are related to S2 subtag but + is related to S7.1.

So I think that the whole tag should be represented with one id - it represents one semantic meaning of a word.
The problem is that the tagset is very large with all modifiers.

@matthewcoole how large is the semantic tagset?

  • is there any limitation of number of slashes in tag?
  • how many + and - have semantic scale? 4. (optionally) one or more ‘pluses’ or ‘minuses’ to indicate a positive or negative position on a semantic scale
  • Do all /subtags/ allow all modifiers?

@matyaskopp
Copy link
Collaborator

A word bunker can be labelled with two tags:

  1. army building
  2. sandy area in golf (I don't play golf, so I am not sure if the description is correct)

Or does the USAS tagger assign only one tag based on context?

@matthewcoole
Copy link
Collaborator

Ok, my only thought if we really want to go down this line is that the <ptr> target could be another pointer to keep the modifiers separate and create some kind of chain, this might mean introducing a decomposition annotation (this all seems really, really messy I know).

So in the above example Admiral = G3/M4/S2mf S7.1+/S2mf would become:

<w xml:id="t1">Admiral</w>

...

<ptr ana="usas:decomposed" target="#t1" xml:id="dtag1"/>
<ptr ana="usas:G3" target="#dtag1"/>
<ptr ana="usas:M4" target="#dtag1"/>
<ptr ana="usas:S2" target="#dtag1" xml:id="usastag1"/>
<ptr ana="usas:m" target="#usastag1"/>
<ptr ana="usas:f" target="#usastag1"/>

<ptr ana="usas:decomposed" target="#t1" xml:id="dtag2"/>
<ptr ana="usas:S7.1" target="#dtag2" xml:id="usastag2"/>
<ptr ana="+" target="#usastag2"/>
<ptr ana="usas:S2" target="#usd_2" xml:id="usastag3"/>
<ptr ana="usas:m" target="#usastag3"/>
<ptr ana="usas:f" target="#usastag3"/>

I think in other projects using USAS we have simply dropped subsequent tags and only taken the first and most likely tag, i.e. Admiral = G3/M4/S2mf S7.1+/S2mf would just be Admiral = G3/M4/S2mf which might eliminate the need for an extra <ptr> containing the usas:decomposed.

@matyaskopp
Copy link
Collaborator

Ok, my only thought if we really want to go down this line is that the target could be another pointer to keep the modifiers separate and create some kind of chain, this might mean introducing a decomposition annotation (this all seems really, really messy I know).

You are now very close to the fvLib/fs and fLib/f elements' role in TEI (https://www.tei-c.org/release/doc/tei-p5-doc/en/html/FS.html#FSBI), which is probably a cleaner way to create chain features.

@matthewcoole
Copy link
Collaborator

So would this mean adding an <fs> inside a word and then have a similar chain of <fs> and <f> to represent the tags?

<w>Admiral
  <fs name="semtag">
    <f ana="usas:G3"/>
    <f ana="usas:M4:/>
    <f ana="usas:S2">
      <fs>
        <f ana="usas:m/>
        <f ana="usas:f/>
      </fs>
    </f>
  ...
</w>

@matyaskopp
Copy link
Collaborator

So would this mean adding an inside a word and then have a similar chain of and to represent the tags?

no I think it can be implemented like "mte:" prefix in ParlaMint-SI:

<w ana="mte:Appfpn"
lemma="spoštovan"
msd="UPosTag=ADJ|Case=Nom|Degree=Pos|Gender=Fem|Number=Plur|VerbForm=Part"
xml:id="ParlaMint-SI_2014-08-25-SDZ7-Izredna-01.seg1.1.1">Spoštovane</w>

and prefix definition, that loads external resource:

<prefixDef ident="mte"
matchPattern="(.+)"
replacementPattern="http://nl.ijs.si/ME/V6/msd/tables/msd-fslib-sl.xml#$1">
<p xml:lang="en">Private URIs with this prefix point to feature-structure elements defining the Slovenian MULTEXT-East Version 6 MSDs.</p>
</prefixDef>

So USAS will not be defined with taxonomy but with the big list of feature structures, that will be stored in an external file.

@TomazErjavec
Copy link
Collaborator Author

If I understand correctly, @matyaskopp proposes to use atomic USAS tags, but that they are pointers to an externaly defined feature structure library:

  • I like this idea, as it will keep the corpus encoding compact
  • the large number of atomic tags is, I think, not a problem
  • the question is where to externally store this library - the logical place would be UCREL or, alternativelly, directly on GitHub or even in the CLARIN.SI repository

@perayson
Copy link
Collaborator

But on p1 of https://ucrel.lancs.ac.uk/usas/usas_guide.pdf I also see: OTHER SYMBOLS UTILISED % = rarity marker (1) @ = rarity marker (2) f = female m = male c = potential antecedents of conceptual anaphors (neutral for number) n = neuter i = indicates a semantic idiom

The rarity markers are used internally as part of one disambiguation method. And the lower case letters are somewhat of a legacy from part of the analysis pipeline where we were trying to link anaphora. In Wmatrix, all these lower case letters and rarity markers are removed when it calculates frequencies, so I believe that they can be ignored here as well, so @matthewcoole could we remove them from the tagged output as part of your script?

@perayson
Copy link
Collaborator

@matthewcoole how large is the semantic tagset?

There are 232 main tags before any subcategories with positive/negative.

  • is there any limitation of number of slashes in tag?

In theory no limit, but normally two, potentially it could be four if two slash tags are joined together by one of the MWE rules.

  • how many + and - have semantic scale? 4. (optionally) one or more ‘pluses’ or ‘minuses’ to indicate a positive or negative position on a semantic scale

Up to three + and up to three -

  • Do all /subtags/ allow all modifiers?

Yes, in theory, but in practice not.

@perayson
Copy link
Collaborator

A word bunker can be labelled with two tags:

  1. army building
  2. sandy area in golf (I don't play golf, so I am not sure if the description is correct)

Or does the USAS tagger assign only one tag based on context?

What you're seeing in the vertical output is the list of all possible semantic tags, including those where you have a MWE tag preceding the single word semantic tags. If you select horizontal output on http://ucrel-api.lancaster.ac.uk/usas/tagger.html then it just picks the first one in the list which is the most likely. We could do the same for ParlaMint corpora. Of course, it is a precision versus recall trade off, and different users might prefer different things. In the MELC project, http://wp.lancs.ac.uk/melc/ we found that less likely tags lower down the list were good candidates for metaphor source and target domains, so Wmatrix retains them, but for frequency lists, only the first choice tag is used.

@perayson
Copy link
Collaborator

If I understand correctly, @matyaskopp proposes to use atomic USAS tags, but that they are pointers to an externaly defined feature structure library:

  • I like this idea, as it will keep the corpus encoding compact
  • the large number of atomic tags is, I think, not a problem
  • the question is where to externally store this library - the logical place would be UCREL or, alternativelly, directly on GitHub or even in the CLARIN.SI repository

Not sure I understand yet exactly what is required, but here https://github.com/UCREL/Multilingual-USAS might be a good place to store this or a new repo in the UCREL space would be fine. We're thinking that we'll need some tag verification potentially as part of https://pypi.org/project/pymusas/ as well but users should be allowed to extend the taxonomy for their own languages or domains if required.

@matyaskopp
Copy link
Collaborator

What you're seeing in the vertical output is the list of all possible semantic tags, including those where you have a MWE tag preceding the single word semantic tags. If you select horizontal output on http://ucrel-api.lancaster.ac.uk/usas/tagger.html then it just picks the first one in the list which is the most likely. We could do the same for ParlaMint corpora. Of course, it is a precision versus recall trade off, and different users might prefer different things. In the MELC project, http://wp.lancs.ac.uk/melc/ we found that less likely tags lower down the list were good candidates for metaphor source and target domains, so Wmatrix retains them, but for frequency lists, only the first choice tag is used.

It is probably better to use the first tag if the tags are sorted because we cannot sort these tags in TEI simply. These two encodings of the word admiral are in TEI equal:

<w ana="usas:G3/M4/S2mf usas:S7.1+/S2mf">admiral</w>
<w ana="usas:S7.1+/S2mf usas:G3/M4/S2mf">admiral</w>

@matthewcoole
Copy link
Collaborator

I think if we take only the first and most likely semantic tag(or combination), as suggested by @perayson in #204, and we also drop the additional markers and lowercase symbols used, we can code things up as suggested. e.g.

Admiral = G3/M4/S2mf S7.1+/S2mf

would simply become

<w ana="usas:G3 usas:M4 usas:S2">Admiral</w>

Then expansion of the sub-categories would only be necessary for up to 3 + (pluses) or - (minuses). This would mean the taxonomy wouldn't need to get orders of magnitude bigger (as it would if expanding all sub-categories with variations for symbols like mf%@cni) and the encoding in the @ana attribute would be reasonably small in size and quite clean.

@TomazErjavec
Copy link
Collaborator Author

Sorry for falling asleep on this issue. But now we need to decide. I like the suggestion above very much - @perayson, do you agree that we do it like this?

@TomazErjavec TomazErjavec added this to the ParlaMint 3.1 release milestone Sep 24, 2023
@TomazErjavec TomazErjavec added MT machine translation related issues and removed bug Something isn't working labels Sep 24, 2023
@TomazErjavec
Copy link
Collaborator Author

On a related note: the current taxonomy has one plus or minus for a category, as is covered by semtags_subcategories.txt.

However, tags can have up to 3 pluses or minuses. The USAS Guide states:

Antonymity of conceptual classifications is indicated by +/- markers on tags
Comparatives and superlatives receive double and triple +/- markers respectively.

So, would it be ok to add "comparative" and "superlative" to the taxonomy for such cases?
For example, we now have:

<category xml:id="F2">
<catDesc>
<term>F2</term>: Drinks and alcohol</catDesc>
<category xml:id="F2p">
<catDesc>
<term>F2+</term>: Excessive drinking</catDesc>
</category>
<category xml:id="F2n">
<catDesc>
<term>F2-</term>: Not drinking</catDesc>
</category>
</category>

This would be expanded to:

<category xml:id="F2">
   <catDesc><term>F2</term>: Drinks and alcohol</catDesc>
   <category xml:id="F2p">
      <catDesc><term>F2+</term>: Excessive drinking</catDesc>
        <category xml:id="F2pp">
          <catDesc><term>F2++</term>: Excessive drinking, comparative</catDesc>
       </category>
        <category xml:id="F2ppp">
          <catDesc><term>F2+++</term>: Excessive drinking, superlative</catDesc>
       </category>
   </category>
   <category xml:id="F2n">
      <catDesc><term>F2-</term>: Not drinking</catDesc>
        <category xml:id="F2nn">
          <catDesc><term>F2--</term>: Not drinking, comparative</catDesc>
       </category>
        <category xml:id="F2nnn">
          <catDesc><term>F2---</term>: Not drinking, superlative</catDesc>
       </category>
   </category>
</category>

Would this work, or do I misunderstand?

@matyaskopp
Copy link
Collaborator

This would be expanded to:

<category xml:id="F2">
   <catDesc><term>F2</term>: Drinks and alcohol</catDesc>
   <category xml:id="F2p">
      <catDesc><term>F2+</term>: Excessive drinking</catDesc>
        <category xml:id="F2pp">
          <catDesc><term>F2++</term>: Excessive drinking, comparative</catDesc>
       </category>
        <category xml:id="F2ppp">
          <catDesc><term>F2+++</term>: Excessive drinking, superlative</catDesc>
       </category>
   </category>
   <category xml:id="F2n">
      <catDesc><term>F2-</term>: Not drinking</catDesc>
        <category xml:id="F2nn">
          <catDesc><term>F2--</term>: Not drinking, comparative</catDesc>
       </category>
        <category xml:id="F2nnn">
          <catDesc><term>F2---</term>: Not drinking, superlative</catDesc>
       </category>
   </category>
</category>

Would this work, or do I misunderstand?

As I understand I think it is ok this way.

But I am not sure if we have made a final decision for other modifiers: mf%@cni, as @matthewcoole suggested:

I think if we take only the first and most likely semantic tag(or combination), as suggested by @perayson in #204, and we also drop the additional markers and lowercase symbols used, we can code things up as suggested. e.g.

Admiral = G3/M4/S2mf S7.1+/S2mf

would simply become

<w ana="usas:G3 usas:M4 usas:S2">Admiral</w>

Then expansion of the sub-categories would only be necessary for up to 3 + (pluses) or - (minuses). This would mean the taxonomy wouldn't need to get orders of magnitude bigger (as it would if expanding all sub-categories with variations for symbols like mf%@cni) and the encoding in the @ana attribute would be reasonably small in size and quite clean.

@perayson
Copy link
Collaborator

Yes, I'm fine with removing mfcni modifiers and %@ rarity markers as mentioned above.

In terms of the up to three plusses or minuses, not all semantic tags will use all these combinations. Do you have to include all these options in the XML taxonomy?

@TomazErjavec
Copy link
Collaborator Author

Yes, I'm fine with removing mfcni modifiers and %@ rarity markers as mentioned above.

Great, glad to hear it! In the meantime I made a draft implemetion of conversion to XML which produces structures such as

<phr type="sem" function="Df/A5.1+++mfnc" ana="sem:D sem:A5.1ppp">
<w pos="NN" msd="UPosTag=NOUN|Number=Sing" lemma="agenda" function="Df/A5.1+++mfnc" ana="sem:D sem:A5.1ppp">agenda</w>
<w pos="IN" msd="UPosTag=ADP" lemma="of" function="Df/A5.1+++mfnc" ana="sem:D sem:A5.1ppp">of</w>
<w pos="DT" msd="UPosTag=DET|Definite=Def|PronType=Art" lemma="the" function="Df/A5.1+++mfnc" ana="sem:D sem:A5.1ppp">the</w>
<w pos="NN" msd="UPosTag=NOUN|Number=Sing" lemma="meeting" function="Df/A5.1+++mfnc" ana="sem:D sem:A5.1ppp">meeting</w>
</phr>

So, in a way, we can have our cake and eat it too.

In terms of the up to three plusses or minuses, not all semantic tags will use all these combinations. Do you have to include all these options in the XML taxonomy?

Probably not, I guess we could filter according to which tags are actually used in the corpora.

@TomazErjavec
Copy link
Collaborator Author

@perayson, I've been working with inserting missing "comparative" or "superlative" (so ++, +++, --, ---) categories into the USAS taxonomy. But I ran into a problem. You wrote:

descriptions of the plus/minus subcategories here: https://ucrel.lancs.ac.uk/usas/semtags_subcategories.txt (which is what @matthewcoole has converted I believe). Normally, you can have up to three '+' or '-' to indicate antonyms, comparatives and superlatives, see here for more details: https://ucrel.lancs.ac.uk/usas/usas_guide.pdf.

However, there are tags such as "I1-", "N1+", which are not in the semtags_subcategories.txt and hence not in the taxonomy, so, I don't know what they mean. My impression was that only two or three plusses/minuses need to be inserted, as there the interpretations is straighforward, but I have not idea how to gloss the antonym of "Money, generally" in an automatic way (and that it is actually helpful).
Is there any other reference giving glosses also for missing categories with just one plus or minus?

That said (but I haven't done an extensive test), these pesky tags do not seem to appear in the first position, i.e. as the most likely tag, so the taxonomy might be able to do without them. Still, it seems nice for it to encompass all exsiting tags.

Some examples if 'I1-' with only relevant columns retained:

tax     SEMMWE=B|SEM=G1.1/I1-
free    SEMMWE=I|SEM=G1.1/I1-
volunteer       SEMMWE=B|SEM=I3.1/I1-
work    work    SEMMWE=I|SEM=I3.1/I1-
voluntary       SEMMWE=B|SEM=S8+/I1-
services        SEMMWE=I|SEM=S8+/I1-

@TomazErjavec
Copy link
Collaborator Author

@perayson, another problem: quite a few annotations have a "D" code, e.g.

$ grep -c 'SEM=D' ParlaMint-LV-en.conllu/*/*.conllu  | head
ParlaMint-LV-en.conllu/2014/ParlaMint-LV-en_2014-11-04-PT12-264.conllu:336
ParlaMint-LV-en.conllu/2014/ParlaMint-LV-en_2014-11-05-PT12-265.conllu:96
ParlaMint-LV-en.conllu/2014/ParlaMint-LV-en_2014-11-05-PT12-266.conllu:422
ParlaMint-LV-en.conllu/2014/ParlaMint-LV-en_2014-11-06-PT12-267.conllu:12

However, looking at https://ucrel.lancs.ac.uk/usas/ there is no D top level code. What shall we do with this?
Pls. note that resolving these questions is now becoming somewhat urgent.

@TomazErjavec
Copy link
Collaborator Author

@perayson, I can't fully processes the already received -en corpora without having the taxonomy finalised, as it is also a part of the distribution and e.g. the vertical files for the concordancers need it in place.
The outstanding questions are:

  • what to do with the "D" category? I am provisionally changing it to Z9
  • what to do with categories having just one plus or one minus missing from the current taxonomy (and your list of tags). If it helps, I can prepare the list based on the currently available corpora

@perayson
Copy link
Collaborator

Just getting back to this after the CLARIN conference and LREC-COLING deadline ... the tags can have up to three plusses or minuses. As explained in https://ucrel.lancs.ac.uk/usas/usas_guide.pdf "Antonymity of conceptual classifications is indicated by +/- markers on tags" and https://ucrel.lancs.ac.uk/usas/semtags_subcategories.txt is not an exhaustive list, so for a gloss for anything missing from that list, you can back off to the main category after removing any +/- extensions. My suggestion would be to stick to the 232 main categories listed on https://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf and https://ucrel.lancs.ac.uk/usas/semtags.txt for the taxonomy itself. The back off is what Wmatrix does where the specific subcategory is not glossed in the additional list.

@perayson
Copy link
Collaborator

In terms of the "D" category, there isn't one but I expect this originates from the "Df" tag from which you're then removing the "f" as discussed above. This is an unimplemented feature in PyMUSAS (UCREL/pymusas#26) that affects a small number of MWEs in the English lexicon. Unless you fancy implementing the transfer of the other tag as described on the issue, then I would recommend the following fix:

  • if the tag is "Df" on its own, then label it as Z9 (in order to retain the MWE set, the other possibility is to back off to the single word semantic tags)
  • If the tag is "Df" as part of a slash tag, then remove the Df and the slash to leave the other semantic tag in place

@TomazErjavec
Copy link
Collaborator Author

My suggestion would be to stick to the 232 main categories listed on https://ucrel.lancs.ac.uk/usas/USASSemanticTagset.pdf and https://ucrel.lancs.ac.uk/usas/semtags.txt for the taxonomy itself.

Well, @matthewcoole made the taxonomy based on https://ucrel.lancs.ac.uk/usas/semtags_subcategories.txt , so I would prefer now to stick to that. Note that USAS taxonomy has two more categories, as A1.7, A1.8 are missing from semtags_subcategories.txt.

Now that I have all the corpora, I could make an exhausitve list of all first tags (so, before the first comma), so it was much easier to make a mapping and check that everything is covered: I've implemented your suggestion for removing / substituting D with Z9 and for backing off for the other tags unknown in the taxonomy. This mostly involves removing a plus or minus, however, there are also some weird tags, in particular: A1.2.4-, A9.1+, G1.1.1, S.1.2.3-, S2F, S4T1.1.1, X7.2+, which I also fix to whatever is ok (A1.2, A9, etc.).

As for encoding USAS tags in TEI:

  • the original (no modifications) USAS tag is recorded in w/@function or pc/@function
  • the first USAS tag converted to taxonomy categories goes to w/@ana or pc/@ana
  • semantic MWEs go to phr; note that as we already have name tags, and we don't want conflicts with them, I do not encode phr if it interferes with name. With this we lose about 1/3 of the MWEs, quite a lot, but I don't see what can be done about this (at least not in the remaining time till release).

Here is an example:

<seg xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1" xml:lang="en" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1">
   <s xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1" n="1" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1">
     <phr type="sem" function="Z1mf,Z3c" ana="sem:Z1">
        <w pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Mr." function="Z1mf,Z3c" ana="sem:Z1">Mr.</w>
        <w pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="President" function="Z1mf,Z3c" ana="sem:Z1" join="right">President</w>
     </phr>
     <pc pos="Z" msd="UPosTag=PUNCT" function="Z9" ana="sem:Z9" join="right">-</pc>

So, with this, we are set for a trial run on one of the corpora. Stay tuned!

@matyaskopp
Copy link
Collaborator

As for encoding USAS tags in TEI:

  • the original (no modifications) USAS tag is recorded in w/@function or pc/@function
  • the first USAS tag converted to taxonomy categories goes to w/@ana or pc/@ana
  • semantic MWEs go to phr; note that as we already have name tags, and we don't want conflicts with them, I do not encode phr if it interferes with name. With this we lose about 1/3 of the MWEs, quite a lot, but I don't see what can be done about this (at least not in the remaining time till release).

Here is an example:

<seg xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1" xml:lang="en" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1">
   <s xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1" n="1" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1">
     <phr type="sem" function="Z1mf,Z3c" ana="sem:Z1">
        <w pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Mr." function="Z1mf,Z3c" ana="sem:Z1">Mr.</w>
        <w pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="President" function="Z1mf,Z3c" ana="sem:Z1" join="right">President</w>
     </phr>
     <pc pos="Z" msd="UPosTag=PUNCT" function="Z9" ana="sem:Z9" join="right">-</pc>

A few general MWE notes/remarks:

  • add xml:id attribute to <phr>
  • I do not think it is good to put function and ana values to descendants of <pht>, because the values annotate the phrase (they don't annotate word or punctuation). So the suggestion is:
<seg xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1" xml:lang="en" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1">
   <s xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1" n="1" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1">
     <phr type="sem" function="Z1mf,Z3c" ana="sem:Z1">
        <w pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Mr.">Mr.</w>
        <w pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="President" join="right">President</w>
     </phr>
     <pc pos="Z" msd="UPosTag=PUNCT" function="Z9" ana="sem:Z9" join="right">-</pc>

MWE conflicts:

In general we have 3 types of interferences and only one of them is a conflict:

  1. (NER) inside [SEMMWE] - not a conflict: The [people of (Latvia)] [exercise their] authority
# sent_id = ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10
# source = Latvijas tauta realizē savu varu ar ievēlēto deputātu starpniecību, tāpēc es aicinu vienmēr atcerēties, ka ikviena deputāta darba devējs ir Latvijas tauta, un aicinu savus pienākumus pildīt godprātīgi, ar pašcieņu, pēc labākās apziņas.
# text = The people of Latvia exercise their authority through the elected Members, so I call always to remember that each Member's employer is the people of Latvia and call on them to carry out their duties in good faith, with self-esteem, with the best consciousness.
1       The     the     DET     DT      Definite=Def|PronType=Art       0       _       _       ForwardAlignment=1|BackwardAlignment=1|NER=O|SpacyLemma=the|SpacyUPoS=DET|SpacyXPoS=DT|SEMMWE=O|SEM=Z5
2       people  people  NOUN    NNS     Number=Plur     1       _       _       ForwardAlignment=2|BackwardAlignment=2|NER=O|SpacyLemma=people|SpacyUPoS=NOUN|SpacyXPoS=NNS|SEMMWE=B|SEM=Z2,Z3c
3       of      of      ADP     IN      _       2       _       _       NER=O|SpacyLemma=of|SpacyUPoS=ADP|SpacyXPoS=IN|SEMMWE=I|SEM=Z2,Z3c
4       Latvia  Latvia  PROPN   NNP     Number=Sing     3       _       _       ForwardAlignment=1|NER=B-LOC|SpacyLemma=Latvia|SpacyUPoS=PROPN|SpacyXPoS=NNP|SEMMWE=I|SEM=Z2,Z3c
5       exercise        exercise        VERB    VBP     Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin   4       _       _       ForwardAlignment=3|BackwardAlignment=3|NER=O|SpacyLemma=exercise|SpacyUPoS=VERB|SpacyXPoS=VBP|SEMMWE=B|SEM=A5.4+
6       their   they    PRON    PRP$    Number=Plur|Person=3|Poss=Yes|PronType=Prs      5       _       _       ForwardAlignment=4|BackwardAlignment=4|NER=O|SpacyLemma=their|SpacyUPoS=PRON|SpacyXPoS=PRP$|SEMMWE=I|SEM=A5.4+
7       authority       authority       NOUN    NN      Number=Sing     6       _       _       ForwardAlignment=5|BackwardAlignment=5|NER=O|SpacyLemma=authority|SpacyUPoS=NOUN|SpacyXPoS=NN|SEMMWE=O|SEM=G1.1c,S7.1+,S7.4+,X2.2+

current encoding (there is also a bug - [exercise their] is not in <phr> even though there is no interference with <name>)

                  <s xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10" n="10" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10">
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t1" pos="DT" msd="UPosTag=DET|Definite=Def|PronType=Art" lemma="the" function="Z5" ana="sem:Z5">The</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t2" pos="NNS" msd="UPosTag=NOUN|Number=Plur" lemma="people" function="Z2,Z3c" ana="sem:Z2">people</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t3" pos="IN" msd="UPosTag=ADP" lemma="of" function="Z2,Z3c" ana="sem:Z2">of</w>
                     <name type="LOC">
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t4" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Latvia" function="Z2,Z3c" ana="sem:Z2">Latvia</w>
                     </name>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t5" pos="VBP" msd="UPosTag=VERB|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin" lemma="exercise" function="A5.4+" ana="sem:A5.4p">exercise</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t6" pos="PRP$" msd="UPosTag=PRON|Number=Plur|Person=3|Poss=Yes|PronType=Prs" lemma="they" function="A5.4+" ana="sem:A5.4p">their</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t7" pos="NN" msd="UPosTag=NOUN|Number=Sing" lemma="authority" function="G1.1c,S7.1+,S7.4+,X2.2+" ana="sem:G1.1">authority</w>

it should be (simplified):

                  <s xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10" n="10" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10">
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t1" pos="DT" msd="UPosTag=DET|Definite=Def|PronType=Art" lemma="the" function="Z5" ana="sem:Z5">The</w>
<phr><!-- NER inside SEMMWE -->
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t2" pos="NNS" msd="UPosTag=NOUN|Number=Plur" lemma="people" function="Z2,Z3c" ana="sem:Z2">people</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t3" pos="IN" msd="UPosTag=ADP" lemma="of" function="Z2,Z3c" ana="sem:Z2">of</w>
                     <name type="LOC">
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t4" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Latvia" function="Z2,Z3c" ana="sem:Z2">Latvia</w>
                     </name>
</phr><!-- NER inside SEMMWE -->
<phr><!-- bugfix -->
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t5" pos="VBP" msd="UPosTag=VERB|Mood=Ind|Number=Plur|Person=3|Tense=Pres|VerbForm=Fin" lemma="exercise" function="A5.4+" ana="sem:A5.4p">exercise</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t6" pos="PRP$" msd="UPosTag=PRON|Number=Plur|Person=3|Poss=Yes|PronType=Prs" lemma="they" function="A5.4+" ana="sem:A5.4p">their</w>
</phr><!-- bugfix -->
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.10.t7" pos="NN" msd="UPosTag=NOUN|Number=Sing" lemma="authority" function="G1.1c,S7.1+,S7.4+,X2.2+" ana="sem:G1.1">authority</w>
  1. [SEMMWE] inside (NER): the (Cabinet of Ministers’ [Equipment Law]) and
# sent_id = ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43
# source = Tāpat arī joprojām uzturu spēkā un esmu pārliecināts, ka būs jāatgriežas pie iepriekšējam Saeimas sasaukumam iesniegtajām izmaiņām Satversmē, Ministru kabineta iekārtas likumā un Saeimas kārtības rullī, kuru mērķis ir padarīt stiprāku un atbildīgāku izpildvaru.
# text = I also continue to feed in force and I am sure that I will have to return to the changes submitted to the previous parliamentary term in the Constitution, the Cabinet of Ministers’ Equipment Law and the Saeima order roll, which aim to make the executive power stronger and more responsible.
### ...
31      the     the     DET     DT      Definite=Def|PronType=Art       30      _       _       ForwardAlignment=21|NER=O|SpacyLemma=the|SpacyUPoS=DET|SpacyXPoS=DT|SEMMWE=O|SEM=Z5
32      Cabinet cabinet PROPN   NNP     Number=Sing     31      _       _       ForwardAlignment=22|BackwardAlignment=21|NER=B-ORG|SpacyLemma=cabinet|SpacyUPoS=PROPN|SpacyXPoS=NNP|SEMMWE=O|SEM=H5,G1.1
33      of      of      ADP     IN      _       32      _       _       ForwardAlignment=21|NER=I-ORG|SpacyLemma=of|SpacyUPoS=ADP|SpacyXPoS=IN|SEMMWE=O|SEM=Z5
34      Ministers       Minister        PROPN   NNPS    Number=Plur     33      _       _       ForwardAlignment=22|BackwardAlignment=22|NER=I-ORG|SpaceAfter=No|SpacyLemma=Minister|SpacyUPoS=PROPN|SpacyXP
oS=NNP|SEMMWE=O|SEM=G1.1/S2mf,S9/S2mf
35      ’       's      PART    POS     _       34      _       _       NER=I-ORG|SpacyLemma='s|SpacyUPoS=PUNCT|SpacyXPoS=''|SEMMWE=O|SEM=A9+,A3+,A2.2,Z5
36      Equipment       equipment       PROPN   NNP     Number=Sing     35      _       _       ForwardAlignment=23|BackwardAlignment=23|NER=I-ORG|SpacyLemma=equipment|SpacyUPoS=PROPN|SpacyXPoS=NNP|SEMMWE=B|SEM=Z1mf,Z3c
37      Law     Law     PROPN   NNP     Number=Sing     36      _       _       ForwardAlignment=24|BackwardAlignment=24|NER=I-ORG|SpacyLemma=Law|SpacyUPoS=PROPN|SpacyXPoS=NNP|SEMMWE=I|SEM=Z1mf,Z3c
38      and     and     CCONJ   CC      _       37      _       _       ForwardAlignment=25|BackwardAlignment=25|NER=O|SpacyLemma=and|SpacyUPoS=CCONJ|SpacyXPoS=CC|SEMMWE=O|SEM=Z5

Current encoding:

<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t31" pos="DT" msd="UPosTag=DET|Definite=Def|PronType=Art" lemma="the" function="Z5" ana="sem:Z5">the</w>
                     <name type="ORG">
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t32" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="cabinet" function="H5,G1.1" ana="sem:H5">Cabinet</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t33" pos="IN" msd="UPosTag=ADP" lemma="of" function="Z5" ana="sem:Z5">of</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t34" pos="NNPS" msd="UPosTag=PROPN|Number=Plur" lemma="Minister" function="G1.1/S2mf,S9/S2mf" ana="sem:G1.1 sem:S2" join="right">Ministers</w>
<pc xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t35" pos="Z" msd="UPosTag=PUNCT" function="A9+,A3+,A2.2,Z5" ana="sem:A9p">’</pc>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t36" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="equipment" function="Z1mf,Z3c" ana="sem:Z1">Equipment</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t37" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Law" function="Z1mf,Z3c" ana="sem:Z1">Law</w>
                     </name>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t38" pos="CC" msd="UPosTag=CCONJ" lemma="and" function="Z5" ana="sem:Z5">and</w>

it should be (simplified):

<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t31" pos="DT" msd="UPosTag=DET|Definite=Def|PronType=Art" lemma="the" function="Z5" ana="sem:Z5">the</w>
                     <name type="ORG">
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t32" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="cabinet" function="H5,G1.1" ana="sem:H5">Cabinet</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t33" pos="IN" msd="UPosTag=ADP" lemma="of" function="Z5" ana="sem:Z5">of</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t34" pos="NNPS" msd="UPosTag=PROPN|Number=Plur" lemma="Minister" function="G1.1/S2mf,S9/S2mf" ana="sem:G1.1 sem:S2" join="right">Ministers</w>
<pc xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t35" pos="Z" msd="UPosTag=PUNCT" function="A9+,A3+,A2.2,Z5" ana="sem:A9p">’</pc>
<phr><!-- SEMMWE inside NER-->
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t36" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="equipment" function="Z1mf,Z3c" ana="sem:Z1">Equipment</w>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t37" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Law" function="Z1mf,Z3c" ana="sem:Z1">Law</w>
</phr><!-- SEMMWE inside NER-->
                     </name>
<w xml:id="ParlaMint-LV_2014-11-04-PT12-264-U2-P1.43.t38" pos="CC" msd="UPosTag=CCONJ" lemma="and" function="Z5" ana="sem:Z5">and</w>
  1. (NER) and [SEMMWE] spans are crossed - the only real conflict: the [new (European] [Union) initiatives] on
# sent_id = ParlaMint-LV_2014-11-04-PT12-264-U2-P1.24
# source = Šai Saeimai arī pēc prezidentūras būs ievērojami jāpastiprina darbs ar Eiropas Savienības institūcijām, it sevišķi laicīgi izvērtējot jauno Eiropas Savienības iniciatīvu ietekmi uz Latviju, un arī iepriekš radīto problēmjautājumu, piemēram, obligātā iepirkuma komponentes ietekmes, risinājumu meklēšana.
# text = The Saeima will also need to step up its work with the European Union institutions after the Presidency, in particular by assessing in a timely manner the impact of the new European Union initiatives on Latvia, and also by looking for solutions to the problems that have arisen in the past, such as the impact of the mandatory procurement component.
### .....
31      the     the     DET     DT      Definite=Def|PronType=Art       30      _       _       NER=O|SpacyLemma=the|SpacyUPoS=DET|SpacyXPoS=DT|SEMMWE=O|SEM=Z5
32      new     new     ADJ     JJ      Degree=Pos      31      _       _       ForwardAlignment=19|BackwardAlignment=19|NER=O|SpacyLemma=new|SpacyUPoS=ADJ|SpacyXPoS=JJ|SEMMWE=B|SEM=T1.3
33      European        European        ADJ     NNP     Degree=Pos      32      _       _       ForwardAlignment=20|BackwardAlignment=20|NER=B-ORG|SpacyLemma=European|SpacyUPoS=PROPN|SpacyXPoS=NNP|SEMMWE=
I|SEM=T1.3
34      Union   Union   PROPN   NNP     Number=Sing     33      _       _       ForwardAlignment=21|BackwardAlignment=21|NER=I-ORG|SpacyLemma=Union|SpacyUPoS=PROPN|SpacyXPoS=NNP|SEMMWE=B|SEM=T1.3
35      initiatives     initiative      NOUN    NNS     Number=Plur     34      _       _       ForwardAlignment=22|BackwardAlignment=22|NER=O|SpacyLemma=initiative|SpacyUPoS=NOUN|SpacyXPoS=NNS|SEMMWE=I|SEM=T1.3
36      on      on      ADP     IN      _       35      _       _       ForwardAlignment=23|BackwardAlignment=24|NER=O|SpacyLemma=on|SpacyUPoS=ADP|SpacyXPoS=IN|SEMMWE=O|SEM=Z5

Here I am not sure about the solution, considering my suggestion above:

  • I do not think it is good to put function and ana values to descendants of <pht>, because the values annotate the phrase (they don't annotate word or punctuation).

@TomazErjavec
Copy link
Collaborator Author

@matyaskopp, thanks for your input. Below my comments:

add xml:id attribute to <phr>

Not sure why we would do this, because:

  • nothing refers to <phr>, so the IDs wouldn't serve any useful function
  • it would also make the TEI less consistent because we don't have IDs on <name> either (and we can't change this now, at least for original language corpora, and the MTed corpora are supposed mimic the original langauge corpora as close as possible)

So, I wouldn't do this now. If you feel this is sensible (presumably both for name and phr) pls. open a Future issue and explain why this would be necessary.

I do not think it is good to put function and ana values to descendants of <phr>, because the values annotate the phrase (they don't annotate word or punctuation).

I was thinking about this but then decided to keep the token level function and ana attributes even inside phrases, mostly because the main platform we have for using the corpora are noSketch(-like) concordancers, and there structural attributes are rather useless and esp. cannot be used in tandem with positional ones. So, e.g. if you do a frequency lexicon on semantic categories (i.e. on positional attributes) you would not get any results for those tokens inside phrases. This to me is wrong, as the results would never show semantic categories of any words inside phrases.
Also, and maybe a bit less convincing, just as a certain single word in a certain context receives a semantic tag, so should single words in a multi-word expression context.

As for the "conflicts" of phr with name: yes, I am aware that I also miss out on phr/name and name/phr, which is why I was careful to write "I do not encode phr if it interferes with name". One reason I avoid this is that the schema gets messy but the main one was that the program to convert CoNLL-U to TEI (i.e. conllu2tei.pl) was written to convert names, but not phrases, and the additions to deal with phrases is a hack, c.f. sub fix_elements there .

To do the name/phr or phr/name nesting the script would need to be completely re-done (with look-ahead to decide what is the correct option, i.e. to either remove a phrase or to keep it, and, if so, what the nesting is) and I don't have the time to do it now (esp. as the processing of the -en corpus will take a week or so).

This unfortunaltelly also goes for phrases which are next to names, a problem I wasn't aware of.
But, again, it would make sense to have a Future issue recording this problem.

@matyaskopp
Copy link
Collaborator

add xml:id attribute to <phr>

Not sure why we would do this, because:

  • nothing refers to <phr>, so the IDs wouldn't serve any useful function
  • it would also make the TEI less consistent because we don't have IDs on <name> either (and we can't change this now, at least for original language corpora, and the MTed corpora are supposed mimic the original langauge corpora as close as possible)

So, I wouldn't do this now. If you feel this is sensible (presumably both for name and phr) pls. open a Future issue and explain why this would be necessary.

I don't think it is necessary. But initially, we thought that it was not necessary to have xml:id in notes but it became useful.

I do not think it is good to put function and ana values to descendants of <phr>, because the values annotate the phrase (they don't annotate word or punctuation).

I was thinking about this but then decided to keep the token level function and ana attributes even inside phrases, mostly because the main platform we have for using the corpora are noSketch(-like) concordancers, and there structural attributes are rather useless and esp. cannot be used in tandem with positional ones. So, e.g. if you do a frequency lexicon on semantic categories (i.e. on positional attributes) you would not get any results for those tokens inside phrases. This to me is wrong, as the results would never show semantic categories of any words inside phrases. Also, and maybe a bit less convincing, just as a certain single word in a certain context receives a semantic tag, so should single words in a multi-word expression context.

At this point, I disagree - the TEI format should be the most brilliant format because other file formats (conllu, vert,...) are derived from it. So the design of this format shouldn't be noSketch/TEITOK-driven because this format needs to be semantically correct.

The derived format can have the information to tokens: eg both TEITOK and noSketch format has UD syntax in tokens
You can also get rid of <phr> if the structure is not to be used in noSketch.


side note (we probably do not have time to implement it now, but the idea can be raised in future)
This and all the interferences and conflicts bring me to #236 which is a kind of standOff annotation that I suggested in #204 (comment)

<seg xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1" xml:lang="en" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1">
   <s xml:id="ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1" n="1" corresp="mt-src:ParlaMint-LV_2014-11-04-PT12-264-U1-P1.1">
    <w xml:id="tok01" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="Mr.">Mr.</w>
    <w xml:id="tok02" pos="NNP" msd="UPosTag=PROPN|Number=Sing" lemma="President" join="right">President</w>
    <pc xml:id="tok03" pos="Z" msd="UPosTag=PUNCT" join="right">-</pc>
<!-- ... -->
    <spanGrp type="sem">
      <span target="#tok01 #tok02" type="Z1mf,Z3c" ana="sem:Z1"/>
      <span target="#tok03" type="Z9" ana="sem:Z9"/>
    </spanGrp>
  </s>
</seg>

but the @function is not allowed in <span>, so it should be replaced with @type

@TomazErjavec
Copy link
Collaborator Author

The points here have been mostly solved, what remains should be taken up in #827.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
MT machine translation related issues Taxonomy
Projects
None yet
Development

No branches or pull requests

4 participants