How to process a large document which has longer text length for NER? #1028

AayushSameerShah · 2023-06-07T14:03:47Z

AayushSameerShah
Jun 7, 2023

📝 Brief

I am trying to use the NER for healthcare wanting to extract key "disorders" or "diseases" from different articles from the web for my use-case.

🧠 The model

I have used the "huggingface" model and followed the procedure like given here JSL Tutorial to convert the HF model in TF and use in SparkNLP. And now I have the following code:

👩🏻‍💻 Code

# loading the saved model
tokenClassifier_loaded = DistilBertForTokenClassification.load("./{}_spark_nlp".format(MODEL_NAME))\
  .setInputCols(["document",'token'])\
  .setOutputCol("ner")\
  .setCaseSensitive(True)\
  .setMaxSentenceLength(512) # Have tried to use this max as possible

document_assembler = DocumentAssembler() \
    .setInputCol('text') \
    .setOutputCol('document')

sentence = SentenceDetector()\
    .setInputCols(['document'])\
    .setOutputCol('sentence')

tokenizer = Tokenizer() \
    .setInputCols(['sentence']) \
    .setOutputCol('token')

converter = NerConverter()\
    .setInputCols(["document", "token", "ner"])\
    .setOutputCol("ner_span")

pipeline = Pipeline(stages=[
    document_assembler, 
    sentence,
    tokenizer,
    tokenClassifier_loaded,
    converter 
])

Then I have the text:

article = \
"""
ample Type / Medical Specialty:
Hematology - Oncology
Sample Name:
Discharge Summary - Mesothelioma - 1
Description: ...
"""

The article has 30K+ characters and with 3K+ words (if split by space). This is where it gets crazy. When I run the following:

data = spark.createDataFrame([[article]]).toDF("text")
result = pipeline.fit(data).transform(data)

row_list = [{'annotatorType': row.annotatorType, 
             'begin': row.begin, 
             'end': row.end,
             'result': row.result,
             'metadata': row.metadata} 
            for row in result.select('ner_span').take(1)[0][0]
           ]
len(row_list)

Returns only 43 entries for entity detection.

🙋🏻‍♂️ The question:

I can understand that whole article can't be passed at once, but there has to be some smart way. Since I am new in here, I am not sure whether to split the article in 512 chunks and pass them one by one or something else.

Will anyone please help me here?

Thank you,
Aayush 🤗

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

How to process a large document which has longer text length for NER? #1028

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

How to process a large document which has longer text length for NER? #1028

Uh oh!

AayushSameerShah Jun 7, 2023

📝 Brief

🧠 The model

👩🏻‍💻 Code

🙋🏻‍♂️ The question:

Replies: 0 comments

AayushSameerShah
Jun 7, 2023