cleanxml and -tokenize.whitespace true do not work together

Dear all,

I get an exception when trying to annotate a very simple XML file.
I'd be very grateful to hear about ideas for workarounds since this is currently stopping me from working with CoreNLP on a pre-tokenized dataset.

Here is the command:
` java -cp "./*:" edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,cleanxml,ssplit,pos -tokenize.whitespace true -tokenize.keepeol true -ssplit.eolonly true -outputFormat json -file ~/parse_2017/orga/test_input.txt`
[tests with clean.allowflawedxml true, clean.singlesentencetags true, etc. did not work either]

Here is the output of CoreNLP:

```
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator tokenize
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator cleanxml
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator ssplit
[main] INFO edu.stanford.nlp.pipeline.StanfordCoreNLP - Adding annotator pos
[main] INFO edu.stanford.nlp.tagger.maxent.MaxentTagger - Loading POS tagger from edu/stanford/nlp/models/pos-tagger/english-left3words/english-left3words-distsim.tagger ... done [0.9 sec].

Processing file /home/hpc/sles/sles000h/parse_2017/orga/test_input.txt ... writing to /home/woody/sles/sles000h/stanford-corenlp-full-2016-10-31/test_input.txt.json
Exception in thread "main" java.lang.IllegalArgumentException: Got a close tag s which does not match any open tag
        at edu.stanford.nlp.pipeline.CleanXmlAnnotator.process(CleanXmlAnnotator.java:624)
        at edu.stanford.nlp.pipeline.CleanXmlAnnotator.annotate(CleanXmlAnnotator.java:244)
        at edu.stanford.nlp.pipeline.AnnotationPipeline.annotate(AnnotationPipeline.java:76)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:605)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.annotate(StanfordCoreNLP.java:615)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:1164)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.processFiles(StanfordCoreNLP.java:945)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.run(StanfordCoreNLP.java:1253)
        at edu.stanford.nlp.pipeline.StanfordCoreNLP.main(StanfordCoreNLP.java:1323)
```

Here is the content of test_input.txt:

```
<corpus> 
<s id="1"> The cat sat on the mat . She knows how to write papers . </s> 
<s id="2"> When will she ever learn? </s> 
</corpus> 

```
[I am aware this is no sensible input. I'm just using it to test that CoreNLP really does no tokenization and sentence-splitting by itself.]

Best,
Peter

Edit: This is the current (3.7.0) release version of CoreNLP.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cleanxml and -tokenize.whitespace true do not work together #448

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

cleanxml and -tokenize.whitespace true do not work together #448

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions