Skip to content

Commit

Permalink
visual nlp 5.5.0 release notes (#1699)
Browse files Browse the repository at this point in the history
  • Loading branch information
albertoandreottiATgmail authored Jan 23, 2025
1 parent 8086c59 commit 365378b
Show file tree
Hide file tree
Showing 3 changed files with 192 additions and 55 deletions.
Binary file added docs/assets/images/obfuscation_impainting.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
124 changes: 69 additions & 55 deletions docs/en/spark_ocr_versions/ocr_release_notes.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,105 +5,119 @@ seotitle: Spark OCR | John Snow Labs
title: Spark OCR release notes
permalink: /docs/en/spark_ocr_versions/ocr_release_notes
key: docs-ocr-release-notes
modify_date: "2024-09-26"
modify_date: "2024-01-23"
show_nav: true
sidebar:
nav: sparknlp-healthcare
---

<div class="h3-box" markdown="1">

## 5.4.1
## 5.5.0

Release date: 26-09-2024
Release date: 23-01-2024

## Visual NLP 5.4.1 Release Notes 🕶️
## Visual NLP 5.5.0 Release Notes 🕶️

**We are glad to announce that Visual NLP 5.4.1 has been released.
This release comes with new models, notebooks, examples, and bug fixes!!! 📢📢📢**
**We are glad to announce that Visual NLP 5.5.0 has been released! This release comes with new Dicom pretrained pipelines, new features, and bug fixes. 📢📢📢**

</div><div class="h3-box" markdown="1">

## Highlights 🔴

* New models for information extraction from Scanned Forms, scoring F1 92.86 on the FUNSD Entity Extraction dataset, and 89.23 on FUNSD Relation Extraction task. This significantly outperforms the Form Understanding capabilities of the AWS, Azure, and Google ‘Form AI’ services.
* New blogpost on Form Extraction in which we deep dive into the advantages of the models released today against different cloud providers.
* New PDF Deidentification Pipeline, that ingests PDFs and returns a de-identified version of the PDF.
* New Obfuscation Features in ImageDrawRegions.
* New obfuscation features in DicomMetadataDeidentifier.
* New Dicom Pretrained Pipelines.
* New VisualDocumentProcessor.

</div><div class="h3-box" markdown="1">

## Other Changes 🟡
## New Obfuscation Features in ImageDrawRegions
ImageDrawRegions' main purpose is to draw solid rectangles on top of regions that typically come from NER or some other similar model. Many times, it is interesting not to only draw solid rectangles on top of detected entities, but some other values, like obfuscated values. For example, with the purpose of protecting patient's privacy, you may want to replace a name with another name, or a date with a modified date.

* Memory Improvements for Models that cut memory requirements of most models by half.
* Minor enhancements, changes, and bug fixes that ensure the quality of the library continues to improve over time.
This feature, together with the Deidentification transformer from Spark NLP for Healthcare can be combined to create a 'rendering aware' obfuscation pipeline capable of rendering obfuscated values back to the source location where the original entities were present. The replacement must be 'rendering aware' because not every example of an entity requires the same space on the page to be rendered. So for example, 'Bob Smith' would be a good replacement for 'Rod Adams', but not for 'Alessandro Rocatagliata', simply because they render differently on the page. Let's take a look at a quick example,

</div><div class="h3-box" markdown="1">
![image](/assets/images/ocr/obfuscation_impainting.png)

### New models for information extraction from Scanned Forms
to the left we see a portion of a document in which we want to apply obfuscation. We want to focus on the entities representing PHI, like patient name or phone number. On the right side, after applying the transformation, we have an image containing fake values.
You can see that the PHI in the source document has been replaced by similar entities, and these entities not only are of a similar category, but are also of a similar length.

This model scores F1 92.86 on the FUNSD Entity Extraction dataset, and 89.23 on FUNSD Relation Extraction task. This significantly outperforms the Form Understanding capabilities of the AWS, Azure, and Google ‘Form AI’ services.

Two new annotators that work together were added:
## New obfuscation features in DicomMetadataDeidentifier
Now you can customize the way metadata is de-identified in DicomMetadataDeidentifier. Customization happens through a number of different actions you can apply to each tag, for example, replacing a specific tag with a literal, or shifting a date by a number of days randomly.
In order to feed the configuration for each of these actions, you need to pass a CSV file to DicomMetadataDeidentifier, like this,


* VisualDocumentNerGeo: This is a Visual NER model based on Geolayout, which achieves a F1-score of 92.86 on the Entity Extraction part of FUNSD. To use it call,
```
ner = VisualDocumentNerGeo().
pretrained("visual_ner_geo_v1", "en", "clinical/ocr/")
DicomMetadataDeidentifier()\
setStrategyFile(path_to_your_csv_file)
```

* GeoRelationExtractor: This is a Relation Extraction model based on Geolayout, which achieves a F1-score of 89.45 on the Relation Extraction part of FUNSD. To use it call,

The CSV you need to provide looks like this,
```
re = GeoRelationExtractor().
pretrained("visual_re_geo_v2", "en", "clinical/ocr/")
Tags,VR,Name,Status,Action,Options
"(0002,0100)",UI,Private Information Creator UID,,hashId
"(0002,0102)",OB,Private Information,,hashId
```

To see these two models combined on a real example, please check this [sample notebook](https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/FormRecognition/FormRecognitionGeo.ipynb).
For example, the first line is an ID, and we are asking to hash the UID, and to replace the value in the output Dicom file with the new hash value.
Here is a more exhaustive list of actions, datatypes and parameters, you can use,

</div><div class="h3-box" markdown="1">

### New PDF Deidentification Pipeline
Key | Datatypes | Parameter examples
-- | -- | --
remove | DA, OB, SH, PN, LT, DT, UI, AS, LO, CS, ST, SQ |
replaceWithLiteral | CS, PN | Susanita Smith, Chest
hashId | OB, SH, UI, LO, CS, SQ | coherent
shiftDateByRandomNbOfDays | DA, LO, AS, DT | coherent
ShiftTimeByRandomNbOfSecs | DT | coherent
replaceWithRandomName | PN, LO | coherent
shiftDateByFixedNbOfDays | DA | 112

A new PDF deidentification pipeline, `pdf_deid_subentity_context_augmented_pipeline` has been added to the library. This new pipeline has the ability to ingest PDFs, apply PHI masking through bounding boxes, and re-create the input PDF as a new document in which the PHI has been removed. The pipeline is ready to use and requires no configuration to handle most use cases.</br>
You can check an example of this pipeline in action in [this notebook](https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/SparkOcrPdfDeidSubentityContextAugmentedPipeline.ipynb).

</div><div class="h3-box" markdown="1">
### New Dicom Pretrained Pipelines
We are releasing three new Dicom Pretrained Pipelines:
* `dicom_deid_generic_augmented_minimal`: this pipeline will remove only PHI from images, and the minimal number of metadata tags.
* `dicom_deid_full_anonymization`: this pipeline will remove all text from images(not only PHI), and most metadata tags. This is the most aggressive de-identification pipeline.
* `dicom_deid_generic_augmented_pseudonym`: this pipeline will try to remove PHI from images, and will obfuscate most tags in metadata.

### Memory Improvements for Models

All ONNX models have been refactored to reduce the memory footprint of the graph. There are two sources of memory problems in ML: models and data. Here we tackle model memory consumption by cutting the memory requirements of models by half.

</div><div class="h3-box" markdown="1">

### Minor enhancements, changes, and bug fixes.

* New `display_xml()` function for displaying tables as XMLs: similar to existing `display_tables()` function, but with XML output instead of Jupyter markdown.

* Enhanced memory management on ImageToTextV2: lifecycle of ONNX session is now aligned with Spark query plan. This means that models are instantiated only one time for each partition, and no leaks occur across multiple calls to transform() on the same pipeline. This results in a more efficient utilisation of memory.

* GPU support in Pix2Struct models: Pix2struct checkpoints for Chart Extraction and Visual Question Answering can leverage GPU like this,
Check notebook [here](https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/Dicom/SparkOcrDicomPretrainedPipelines.ipynb) for examples on how to use this.

### New Visual Document Processor
New VisualDocumentProcessor that produces OCR text and tables on a single pass!,
In plugs and play into any Visual NLP pipeline, it receives images, and it returns texts and tables following the same existing schemas for these datatypes,
```
visual_question_answering = VisualQuestionAnswering()\
.pretrained("info_docvqa_pix2struct_jsl_base_opt", "en", "clinical/ocr")\
.setUseGPU(True)
proc = VisualDocumentProcessor() \
.setInputCol("image") \
.setOutputCols(["text", "tables"]) \
.setFreeTextOnly(True) \
.setOcrEngine(VisualDocumentProcessorOcrEngines.EASYOCR)
result = proc.transform(df)
```

* Better support for different coordinate formats: we improved the way in which coordinates are handled within Visual NLP. Now, each annotator can detect whether coordinates being fed are regular coordinates or rotated coordinates. Forget about things like `ImageDrawRegions.setRotated()` to choose a specific input format, now everything happens automatically behind the scenes.
Check this [sample notebook](https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/SparkOcrVisualDocumentProcessor.ipynb) for an example on how to use it.

* New blogpost on Form Extraction
We have recently released a new [Medium blogpost](https://medium.com/john-snow-labs/visual-document-understanding-benchmark-comparative-analysis-of-in-house-and-cloud-based-form-75f6fbf1ae5f) where we compare our in-house models against different cloud providers. We provide metrics and discuss the results.
### Other Dicom Changes
* DicomDrawRegions support for setting compression quality, now you can pick different compression qualities for each of the different compression algorithms supported. The API receives an array with each element specifying the compression type like a key/value,
Example,
```
DicomDrawRegions()\
.setCompressionQuality(["8Bit=90","LSNearLossless=2"])
```

Key takeaways are that first, Visual NLP's small models can beat cloud providers while at the same time remaining fast and providing more deployment options. Second, in order to obtain a model to be used in practice, fine tuning is mandatory.
### Enhancements & Bug Fixes
* New parameter in SVS tool that specifies whether to rename output file or not,
```
from sparkocr.utils.svs.deidentifier import remove_phi
remove_phi(input_path, output_path, rename=True)
```
* Improved memory management in ImageTextDetectorCraft.
* Fixed a memory leak in ImageToTextV2.
* Fixed a bug in VisualDocumentNerLilt that happened when saving the model after fine tuning.


This release is compatible with Spark-NLP 5.4.1, and Spark NLP for Healthcare 5.4.1.
This release is compatible with Spark-NLP 5.5.2, and Spark NLP for Healthcare 5.5.2.

</div><div class="h3-box" markdown="1">

## Previous versions

</div>

{%- include docs-sparckocr-pagination.html -%}
{%- include docs-sparckocr-pagination.html -%}
123 changes: 123 additions & 0 deletions docs/en/spark_ocr_versions/release_notes_5_5_0.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
---
layout: docs
header: true
seotitle: Spark OCR | John Snow Labs
title: Spark OCR release notes
permalink: /docs/en/spark_ocr_versions/release_notes_5_5_0
key: docs-ocr-release-notes
modify_date: "2024-01-23"
show_nav: true
sidebar:
nav: sparknlp-healthcare
---

<div class="h3-box" markdown="1">

## 5.5.0

Release date: 23-01-2024

## Visual NLP 5.5.0 Release Notes 🕶️

**We are glad to announce that Visual NLP 5.5.0 has been released! This release comes with new Dicom pretrained pipelines, new features, and bug fixes. 📢📢📢**

</div><div class="h3-box" markdown="1">

## Highlights 🔴

* New Obfuscation Features in ImageDrawRegions.
* New obfuscation features in DicomMetadataDeidentifier.
* New Dicom Pretrained Pipelines.
* New VisualDocumentProcessor.

## New Obfuscation Features in ImageDrawRegions
ImageDrawRegions' main purpose is to draw solid rectangles on top of regions that typically come from NER or some other similar model. Many times, it is interesting not to only draw solid rectangles on top of detected entities, but some other values, like obfuscated values. For example, with the purpose of protecting patient's privacy, you may want to replace a name with another name, or a date with a modified date.

This feature, together with the Deidentification transformer from Spark NLP for Healthcare can be combined to create a 'rendering aware' obfuscation pipeline capable of rendering obfuscated values back to the source location where the original entities were present. The replacement must be 'rendering aware' because not every example of an entity requires the same space on the page to be rendered. So for example, 'Bob Smith' would be a good replacement for 'Rod Adams', but not for 'Alessandro Rocatagliata', simply because they render differently on the page. Let's take a look at a quick example,

![image](/assets/images/ocr/obfuscation_impainting.png)

to the left we see a portion of a document in which we want to apply obfuscation. We want to focus on the entities representing PHI, like patient name or phone number. On the right side, after applying the transformation, we have an image containing fake values.
You can see that the PHI in the source document has been replaced by similar entities, and these entities not only are of a similar category, but are also of a similar length.


## New obfuscation features in DicomMetadataDeidentifier
Now you can customize the way metadata is de-identified in DicomMetadataDeidentifier. Customization happens through a number of different actions you can apply to each tag, for example, replacing a specific tag with a literal, or shifting a date by a number of days randomly.
In order to feed the configuration for each of these actions, you need to pass a CSV file to DicomMetadataDeidentifier, like this,

```
DicomMetadataDeidentifier()\
setStrategyFile(path_to_your_csv_file)
```

The CSV you need to provide looks like this,
```
Tags,VR,Name,Status,Action,Options
"(0002,0100)",UI,Private Information Creator UID,,hashId
"(0002,0102)",OB,Private Information,,hashId
```

For example, the first line is an ID, and we are asking to hash the UID, and to replace the value in the output Dicom file with the new hash value.
Here is a more exhaustive list of actions, datatypes and parameters, you can use,

Key | Datatypes | Parameter examples
-- | -- | --
remove | DA, OB, SH, PN, LT, DT, UI, AS, LO, CS, ST, SQ |
replaceWithLiteral | CS, PN | Susanita Smith, Chest
hashId | OB, SH, UI, LO, CS, SQ | coherent
shiftDateByRandomNbOfDays | DA, LO, AS, DT | coherent
ShiftTimeByRandomNbOfSecs | DT | coherent
replaceWithRandomName | PN, LO | coherent
shiftDateByFixedNbOfDays | DA | 112


### New Dicom Pretrained Pipelines
We are releasing three new Dicom Pretrained Pipelines:
* `dicom_deid_generic_augmented_minimal`: this pipeline will remove only PHI from images, and the minimal number of metadata tags.
* `dicom_deid_full_anonymization`: this pipeline will remove all text from images(not only PHI), and most metadata tags. This is the most aggressive de-identification pipeline.
* `dicom_deid_generic_augmented_pseudonym`: this pipeline will try to remove PHI from images, and will obfuscate most tags in metadata.

Check notebook [here](https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/Dicom/SparkOcrDicomPretrainedPipelines.ipynb) for examples on how to use this.

### New Visual Document Processor
New VisualDocumentProcessor that produces OCR text and tables on a single pass!,
In plugs and play into any Visual NLP pipeline, it receives images, and it returns texts and tables following the same existing schemas for these datatypes,
```
proc = VisualDocumentProcessor() \
.setInputCol("image") \
.setOutputCols(["text", "tables"]) \
.setFreeTextOnly(True) \
.setOcrEngine(VisualDocumentProcessorOcrEngines.EASYOCR)
result = proc.transform(df)
```

Check this [sample notebook](https://github.com/JohnSnowLabs/visual-nlp-workshop/blob/master/jupyter/SparkOcrVisualDocumentProcessor.ipynb) for an example on how to use it.

### Other Dicom Changes
* DicomDrawRegions support for setting compression quality, now you can pick different compression qualities for each of the different compression algorithms supported. The API receives an array with each element specifying the compression type like a key/value,
Example,
```
DicomDrawRegions()\
.setCompressionQuality(["8Bit=90","LSNearLossless=2"])
```

### Enhancements & Bug Fixes
* New parameter in SVS tool that specifies whether to rename output file or not,
```
from sparkocr.utils.svs.deidentifier import remove_phi
remove_phi(input_path, output_path, rename=True)
```
* Improved memory management in ImageTextDetectorCraft.
* Fixed a memory leak in ImageToTextV2.
* Fixed a bug in VisualDocumentNerLilt that happened when saving the model after fine tuning.


This release is compatible with Spark-NLP 5.5.2, and Spark NLP for Healthcare 5.5.2.

</div><div class="h3-box" markdown="1">

## Previous versions

</div>

{%- include docs-sparckocr-pagination.html -%}

0 comments on commit 365378b

Please sign in to comment.