Skip to content

Handling unigram and bigram features at the same time in word2features #137

@AbhishekBose

Description

@AbhishekBose

Hello,
I am trying to perform an NER experiment on a custom dataset containing a lot of food items.
I have labels for certain unigrams and bigrams for my training data.

My label corpus contains "green chilli" = "vegetable". I don't have "chilli" as a label
I am using this label list in order to annotate sentences for NER.

For example:

A sentence might contain a bigram such as "green chilli" with it's associated label = "vegetable"

Currently while generating the features, I am marking both "green" and "chilli" as "vegetable".
My annotation pipeline is as follows:

  • Split sentence into unigrams
  • Check if unigram exists in label list -> If label exists mark unigram with label
  • Get bigram by considering token + sentence[idx+1] or token + sentence[idx-1]
  • Check if bigram exists in label corpus -->> mark both token and sentence[idx+1] or sentence[idx-1] with that label

As a result of point number 4, both green and chilli get marked as vegetable

So when I train my model and run inference on a test sentence containing "green chilli", I would get "vegetable", "vegetable" twice.

What would be the best way to annotate this using word2features?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions