Skip to content

cleanxml and -tokenize.whitespace true do not work together #448

Open
@peteruhrig

Description

@peteruhrig

Dear all,

I get an exception when trying to annotate a very simple XML file.
I'd be very grateful to hear about ideas for workarounds since this is currently stopping me from working with CoreNLP on a pre-tokenized dataset.

Here is the command:
java -cp "./*:" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,cleanxml,ssplit,pos -tokenize.whitespace true -tokenize.keepeol true -ssplit.eolonly true -outputFormat json -file ~/parse_2017/orga/test_input.txt
[tests with clean.allowflawedxml true, clean.singlesentencetags true, etc. did not work either]

Here is the output of CoreNLP:

[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator cleanxml
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.9 sec].

Processing file /home/hpc/sles/sles000h/parse_2017/orga/test_input.txt ... writing to /home/woody/sles/sles000h/stanford-corenlp-full-2016-10-31/test_input.txt.json
Exception in thread "main" java.lang.IllegalArgumentException: Got a close tag s which does not match any open tag
        at edu.stanford.nlp.pipeline.CleanXmlAnnotator.process(CleanXmlAnnotator.java:624)
        at edu.stanford.nlp.pipeline.CleanXmlAnnotator.annotate(CleanXmlAnnotator.java:244)
        at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:605)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:615)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1164)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:945)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.run(StanfordCoreNLP.java:1253)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1323)

Here is the content of test_input.txt:

<corpus> 
<s id="1"> The cat sat on the mat . She knows how to write papers . </s> 
<s id="2"> When will she ever learn? </s> 
</corpus> 

[I am aware this is no sensible input. I'm just using it to test that CoreNLP really does no tokenization and sentence-splitting by itself.]

Best,
Peter

Edit: This is the current (3.7.0) release version of CoreNLP.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions