Description
Dear all,
I get an exception when trying to annotate a very simple XML file.
I'd be very grateful to hear about ideas for workarounds since this is currently stopping me from working with CoreNLP on a pre-tokenized dataset.
Here is the command:
java -cp "./*:" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,cleanxml,ssplit,pos -tokenize.whitespace true -tokenize.keepeol true -ssplit.eolonly true -outputFormat json -file ~/parse_2017/orga/test_input.txt
[tests with clean.allowflawedxml true, clean.singlesentencetags true, etc. did not work either]
Here is the output of CoreNLP:
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator cleanxml
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.9 sec].
Processing file /home/hpc/sles/sles000h/parse_2017/orga/test_input.txt ... writing to /home/woody/sles/sles000h/stanford-corenlp-full-2016-10-31/test_input.txt.json
Exception in thread "main" java.lang.IllegalArgumentException: Got a close tag s which does not match any open tag
at edu.stanford.nlp.pipeline.CleanXmlAnnotator.process(CleanXmlAnnotator.java:624)
at edu.stanford.nlp.pipeline.CleanXmlAnnotator.annotate(CleanXmlAnnotator.java:244)
at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:605)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:615)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1164)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:945)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.run(StanfordCoreNLP.java:1253)
at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1323)
Here is the content of test_input.txt:
<corpus>
<s id="1"> The cat sat on the mat . She knows how to write papers . </s>
<s id="2"> When will she ever learn? </s>
</corpus>
[I am aware this is no sensible input. I'm just using it to test that CoreNLP really does no tokenization and sentence-splitting by itself.]
Best,
Peter
Edit: This is the current (3.7.0) release version of CoreNLP.