Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider supporting ISO-TimeML standard #92

Open
narnold-cl opened this issue Nov 26, 2021 · 0 comments
Open

Consider supporting ISO-TimeML standard #92

narnold-cl opened this issue Nov 26, 2021 · 0 comments

Comments

@narnold-cl
Copy link

The ISO-TimeML version of the TimeML Standard offers (at least) the following benefits:

  • Standoff Annotations (Chapter 3.3) (see compromise below)
  • It preserves Tokenization

Read about it here:
https://lexitron.nectec.or.th/public/LREC-2010_Malta/pdf/55_Paper.pdf

If supporting the complete standard is too much work, it would still be nice, to have standoff annotations. We currently calculate those manually and fuzzy-match them to the Token- and Sentence-Boundaries detected by our own Preprocessing Pipeline.

Compromise to add standoff information to actual inline TimeML annotations

A simple fix to this specific problem would be (optionally) adding the CharacterPositions to the tagged Spans like so:

# input text:
"Today I feel great."

# currently generated TimeML output:
'<?xml version="1.0"?><!DOCTYPE TimeML SYSTEM "TimeML.dtd"><TimeML>
<TIMEX3 tid="t1" type="DATE" value="2021-11-16">Today</TIMEX3> nothing happened.
</TimeML>'

# Proposed additional tag-attributes (orig_start_char, orig_end_char):
<TIMEX3 tid="t1" type="DATE" value="2021-11-16" orig_start_char="0" orig_end_char="5">Today</TIMEX3>

So this would capture the information the Original-Span tagged by the TIMEX3 with tid t1, is referring to the Span from character 0 (inclusive) to character 5 (exclusive).

Again, this information is necessary to synchronize HeidelTimes internally used but then forgotten Tokenization with your own Tokenization.

The information for those additional attributes should be easily accessible at runtime.

We've already implemented a first draft of a parsing algorithm that incrementally generates those char-based Span indices afterwards, but it feels like it's a lot of duplicate work to reconstruct information that has already been there at HeidelTime's runtime.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant