Skip to content

Non-uniform tokenization of sentences having dialogue #223

Open
@NikhilPr95

Description

@NikhilPr95

A sentence which has quoted as well as non-quoted words in it is not parsed uniformly.

Given sentences such as-
"Where were you?" asked Mary angrily.

It will parse roughly half the sentences as one sentence -

  1. "Where were you?" asked Mary angrily.

and the other half as -

  1. "Where were you?"
  2. asked Mary angrily.

This occurs when the following code is executed (in the most recent version)-

             Properties props = new Properties();
             props.setProperty("annotators", "tokenize, ssplit, pos, lemma, ner, depparse");

             pipeline = new StanfordCoreNLP(props);

         Annotation document = new Annotation(doc);
             pipeline.annotate(document);

             List<CoreMap> sentences = document.get(SentencesAnnotation.class);

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions