Skip to content

Commit 4311508

Browse files
authored
Merge pull request #28 from DataFog/feature/v3.2.1
v3.2.1
2 parents ffe5f8d + cf36dd6 commit 4311508

File tree

778 files changed

+274838
-107
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

778 files changed

+274838
-107
lines changed

.env

+4
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,4 @@
1+
APPLICATIONINSIGHTS_CONNECTION_STRING="InstrumentationKey=00bea047-1836-46fa-9652-26d43d63a3fa;IngestionEndpoint=https://eastus-8.in.applicationinsights.azure.com/;LiveEndpoint=https://eastus.livediagnostics.monitor.azure.com/;ApplicationId=959cc365-c112-491b-af69-b196d0943ca4"
2+
3+
4+
# note this is an Azure specific implementation of the OpenTelemetry distro. for more information please see https://github.com/Azure/azure-sdk-for-python/tree/main/sdk/monitor/azure-monitor-opentelemetry

.github/workflows/dev-cicd-tests.yml

+8-2
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ jobs:
2121
runs-on: ubuntu-latest
2222
strategy:
2323
matrix:
24-
python-version: ["3.10", "3.11"]
24+
python-version: ["3.10"]
2525
steps:
2626
- uses: actions/setup-python@v4
2727
with:
@@ -33,10 +33,16 @@ jobs:
3333
tox -- --cov datafog --cov-report xml --cov-report term
3434
- name: Submit to codecov
3535
uses: codecov/codecov-action@v3
36-
if: ${{ matrix.python-version == '3.11' }}
36+
if: ${{ matrix.python-version == '3.10' }}
3737

3838
- name: Upload coverage reports to Codecov
3939
uses: codecov/[email protected]
4040
env:
4141
token: ${{ secrets.CODECOV_TOKEN }}
4242
slug: DataFog/datafog-python
43+
44+
- name: Run script
45+
env:
46+
APPLICATIONINSIGHTS_CONNECTION_STRING: ${{ secrets.APPLICATIONINSIGHTS_CONNECTION_STRING }}
47+
run: |
48+
python datafog/telemetry/open_telemetry.py

.github/workflows/feature-ci-cd.yml

+48
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
name: feature-cicd-tests
2+
3+
on:
4+
push:
5+
branches:
6+
- feature/*
7+
pull_request:
8+
branches:
9+
- feature/*
10+
11+
jobs:
12+
lint:
13+
runs-on: ubuntu-latest
14+
steps:
15+
- name: Check out repo
16+
uses: actions/checkout@v3
17+
- name: Run pre-commit
18+
uses: pre-commit/[email protected]
19+
20+
build:
21+
runs-on: ubuntu-latest
22+
strategy:
23+
matrix:
24+
python-version: ["3.10"]
25+
steps:
26+
- uses: actions/setup-python@v4
27+
with:
28+
python-version: ${{ matrix.python-version }}
29+
- uses: actions/checkout@v3
30+
- name: Test with tox
31+
run: |
32+
pip install tox
33+
tox -- --cov datafog --cov-report xml --cov-report term
34+
- name: Submit to codecov
35+
uses: codecov/codecov-action@v3
36+
if: ${{ matrix.python-version == '3.10' }}
37+
38+
- name: Upload coverage reports to Codecov
39+
uses: codecov/[email protected]
40+
env:
41+
token: ${{ secrets.CODECOV_TOKEN }}
42+
slug: DataFog/datafog-python
43+
44+
- name: Run script
45+
env:
46+
APPLICATIONINSIGHTS_CONNECTION_STRING: ${{ secrets.APPLICATIONINSIGHTS_CONNECTION_STRING }}
47+
run: |
48+
python datafog/telemetry/open_telemetry.py

.gitignore

+2
Original file line numberDiff line numberDiff line change
@@ -13,6 +13,7 @@ build/
1313
/src/datafog/pii_tools/__pycache__/
1414
/tests/__pycache__/
1515
/tests/scratch.py
16+
/tests/.datafog_env/
1617
node_modules/
1718
datafog_debug.log
1819
sotu_2023.txt
@@ -23,4 +24,5 @@ datafog-python/datafog/processing/text_processing/__pycache__/
2324
datafog-python/datafog/services/__pycache__/
2425
datafog-python/datafog/processing/__pycache__/
2526
datafog-python/datafog/__pycache__/
27+
.env
2628

README.md

+4-5
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,6 @@ DataFog can be installed via pip:
3939
pip install datafog
4040
```
4141

42-
4342
## Getting Started
4443

4544
The DataFog library provides functionality for text and image processing, including PII (Personally Identifiable Information) annotation and OCR (Optical Character Recognition) capabilities.
@@ -54,8 +53,7 @@ pip install datafog
5453

5554
### Usage
5655

57-
The [Getting Started notebook](/datafog-python/examples/getting_started.ipynb) features a standalone Colab notebook.
58-
56+
The [Getting Started notebook](/datafog-python/examples/getting_started.ipynb) features a standalone Colab notebook.
5957

6058
#### Text PII Annotation
6159

@@ -75,7 +73,9 @@ with open(os.path.join(folder_path, text_files[0]), 'r') as file:
7573
7674
display(Markdown(clinical_note))
7775
```
76+
7877
which looks like this:
78+
7979
```
8080
8181
**Date:** April 10, 2024
@@ -124,7 +124,6 @@ loop = asyncio.get_event_loop()
124124
results = loop.run_until_complete(run_text_pipeline_demo())
125125
```
126126

127-
128127
Note: The DataFog library uses asynchronous programming, so make sure to use the `async`/`await` syntax when calling the appropriate methods.
129128

130129
#### OCR PII Annotation
@@ -146,7 +145,7 @@ loop.run_until_complete(run_ocr_pipeline_demo())
146145
147146
```
148147

149-
You'll notice that we use async functions liberally throughout the SDK - given the nature of the functions we're providing and the extension of DataFog into API/other formats, this allows the functions to be more easily adapted for those uses.
148+
You'll notice that we use async functions liberally throughout the SDK - given the nature of the functions we're providing and the extension of DataFog into API/other formats, this allows the functions to be more easily adapted for those uses.
150149

151150
## Contributing
152151

datafog/__about__.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1 +1 @@
1-
__version__ = "3.2.0"
1+
__version__ = "3.2.1"

datafog/__init__.py

+4-3
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
from .__about__ import __version__
12
from .config import OperationType
23
from .main import DataFog, OCRPIIAnnotator, TextPIIAnnotator
34
from .processing.image_processing.donut_processor import DonutProcessor
@@ -7,8 +8,7 @@
78
from .services.image_service import ImageService
89
from .services.spark_service import SparkService
910
from .services.text_service import TextService
10-
11-
# from .__about__ import __version__
11+
from .telemetry import Telemetry
1212

1313
__all__ = [
1414
"DonutProcessor",
@@ -22,5 +22,6 @@
2222
"SpacyPIIAnnotator",
2323
"ImageDownloader",
2424
"PytesseractProcessor",
25-
# "__version__",
25+
"__version__",
26+
"Telemetry",
2627
]

datafog/config.py

-2
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,5 @@
11
from enum import Enum
22

3-
from pydantic import BaseModel
4-
53

64
class OperationType(str, Enum):
75
ANNOTATE_PII = "annotate_pii"

datafog/main.py

+69-13
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,34 @@
1313
from .services.image_service import ImageService
1414
from .services.spark_service import SparkService
1515
from .services.text_service import TextService
16-
16+
from opentelemetry import trace
17+
from opentelemetry.sdk.trace import TracerProvider
18+
from opentelemetry.sdk.trace.export import BatchSpanProcessor
19+
import os
20+
from opentelemetry import trace
21+
from opentelemetry.sdk.trace import TracerProvider
22+
from opentelemetry.sdk.trace.export import BatchSpanProcessor
23+
from azure.monitor.opentelemetry.exporter import AzureMonitorTraceExporter
24+
from azure.monitor.opentelemetry import configure_azure_monitor
25+
import platform
26+
from opentelemetry.trace import Status, StatusCode
27+
28+
# Use environment variable if available, otherwise fall back to hardcoded value
29+
from opentelemetry import trace
30+
from opentelemetry.sdk.trace import TracerProvider
31+
from opentelemetry.sdk.trace.export import BatchSpanProcessor
32+
from logging import INFO, getLogger
33+
from dotenv import load_dotenv
34+
import logging
35+
36+
load_dotenv() # Load environment variables from .env file
37+
APPLICATIONINSIGHTS_CONNECTION_STRING = os.getenv("APPLICATIONINSIGHTS_CONNECTION_STRING")
38+
configure_azure_monitor(connection_string=APPLICATIONINSIGHTS_CONNECTION_STRING)
39+
trace.set_tracer_provider(TracerProvider())
40+
exporter = AzureMonitorTraceExporter(connection_string=APPLICATIONINSIGHTS_CONNECTION_STRING)
41+
trace.get_tracer_provider().add_span_processor(BatchSpanProcessor(exporter))
42+
logger = logging.getLogger("datafog_logger")
43+
logger.setLevel(INFO)
1744

1845
class DataFog:
1946
def __init__(
@@ -27,23 +54,52 @@ def __init__(
2754
self.text_service = text_service
2855
self.spark_service: SparkService = spark_service
2956
self.operations: List[OperationType] = operations
57+
self.logger = logging.getLogger(__name__)
58+
self.logger.info("Initializing DataFog class with the following services and operations:")
59+
self.logger.info(f"Image Service: {type(image_service)}")
60+
self.logger.info(f"Text Service: {type(text_service)}")
61+
self.logger.info(f"Spark Service: {type(spark_service) if spark_service else 'None'}")
62+
self.logger.info(f"Operations: {operations}")
63+
self.tracer = trace.get_tracer(__name__)
3064

3165
async def run_ocr_pipeline(self, image_urls: List[str]):
3266
"""Run the OCR pipeline asynchronously."""
33-
extracted_text = await self.image_service.ocr_extract(image_urls)
34-
if OperationType.ANNOTATE_PII in self.operations:
35-
annotated_text = await self.text_service.batch_annotate_texts(
36-
extracted_text
37-
)
38-
return annotated_text
39-
return extracted_text
40-
67+
with self.tracer.start_as_current_span("run_ocr_pipeline") as span:
68+
try:
69+
extracted_text = await self.image_service.ocr_extract(image_urls)
70+
self.logger.info(f"OCR extraction completed for {len(image_urls)} images.")
71+
self.logger.debug(f"Total length of extracted text: {sum(len(text) for text in extracted_text)}")
72+
73+
if OperationType.ANNOTATE_PII in self.operations:
74+
annotated_text = await self.text_service.batch_annotate_texts(extracted_text)
75+
self.logger.info(f"Text annotation completed with {len(annotated_text)} annotations.")
76+
return annotated_text
77+
78+
return extracted_text
79+
except Exception as e:
80+
self.logger.error(f"Error in run_ocr_pipeline: {str(e)}")
81+
span.set_status(Status(StatusCode.ERROR, str(e)))
82+
raise
4183
async def run_text_pipeline(self, texts: List[str]):
4284
"""Run the text pipeline asynchronously."""
43-
if OperationType.ANNOTATE_PII in self.operations:
44-
annotated_text = await self.text_service.batch_annotate_texts(texts)
45-
return annotated_text
46-
return texts
85+
with self.tracer.start_as_current_span("run_text_pipeline") as span:
86+
try:
87+
self.logger.info(f"Starting text pipeline with {len(texts)} texts.")
88+
if OperationType.ANNOTATE_PII in self.operations:
89+
annotated_text = await self.text_service.batch_annotate_texts(texts)
90+
self.logger.info(f"Text annotation completed with {len(annotated_text)} annotations.")
91+
return annotated_text
92+
93+
self.logger.info("No annotation operation found; returning original texts.")
94+
return texts
95+
except Exception as e:
96+
self.logger.error(f"Error in run_text_pipeline: {str(e)}")
97+
span.set_status(Status(StatusCode.ERROR, str(e)))
98+
raise
99+
def _add_attributes(self, span, attributes: dict):
100+
"""Add multiple attributes to a span."""
101+
for key, value in attributes.items():
102+
span.set_attribute(key, value)
47103

48104

49105
class OCRPIIAnnotator:

datafog/processing/__init__.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
from .image_processing.donut_processor import DonutProcessor
22
from .image_processing.image_downloader import ImageDownloader
33
from .image_processing.pytesseract_processor import PytesseractProcessor
4-
from .spark_processing.pyspark_udfs import broadcast_pii_annotator_udf, pii_annotator
4+
5+
# from .spark_processing.pyspark_udfs import broadcast_pii_annotator_udf, pii_annotator
6+
from .spark_processing import get_pyspark_udfs
57
from .text_processing.spacy_pii_annotator import SpacyPIIAnnotator
Original file line numberDiff line numberDiff line change
@@ -1 +1,7 @@
1-
from .pyspark_udfs import broadcast_pii_annotator_udf, pii_annotator
1+
# from .pyspark_udfs import broadcast_pii_annotator_udf, pii_annotator
2+
3+
4+
def get_pyspark_udfs():
5+
from .pyspark_udfs import broadcast_pii_annotator_udf, pii_annotator
6+
7+
return broadcast_pii_annotator_udf, pii_annotator

datafog/processing/spark_processing/pyspark_udfs.py

+24-4
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,9 @@
1+
import importlib
2+
import subprocess
3+
import sys
4+
15
import requests
26
import spacy
3-
from pyspark.sql import SparkSession
4-
from pyspark.sql.functions import udf
5-
from pyspark.sql.types import ArrayType, StringType, StructField, StructType
67

78
PII_ANNOTATION_LABELS = ["DATE_TIME", "LOC", "NRP", "ORG", "PER"]
89
MAXIMAL_STRING_SIZE = 1000000
@@ -14,6 +15,11 @@ def pii_annotator(text: str, broadcasted_nlp) -> list[list[str]]:
1415
Returns:
1516
list[list[str]]: Values as arrays in order defined in the PII_ANNOTATION_LABELS.
1617
"""
18+
ensure_installed("pyspark")
19+
from pyspark.sql import SparkSession
20+
from pyspark.sql.functions import udf
21+
from pyspark.sql.types import ArrayType, StringType, StructField, StructType
22+
1723
if text:
1824
if len(text) > MAXIMAL_STRING_SIZE:
1925
# Cut the strings for required sizes
@@ -35,13 +41,27 @@ def pii_annotator(text: str, broadcasted_nlp) -> list[list[str]]:
3541

3642

3743
def broadcast_pii_annotator_udf(
38-
spark_session: SparkSession, spacy_model: str = "en_spacy_pii_fast"
44+
spark_session=None, spacy_model: str = "en_spacy_pii_fast"
3945
):
4046
"""Broadcast PII annotator across Spark cluster and create UDF"""
47+
ensure_installed("pyspark")
48+
from pyspark.sql import SparkSession
49+
from pyspark.sql.functions import udf
50+
from pyspark.sql.types import ArrayType, StringType, StructField, StructType
51+
52+
if not spark_session:
53+
spark_session = SparkSession.builder.getOrCreate()
4154
broadcasted_nlp = spark_session.sparkContext.broadcast(spacy.load(spacy_model))
4255

4356
pii_annotation_udf = udf(
4457
lambda text: pii_annotator(text, broadcasted_nlp),
4558
ArrayType(ArrayType(StringType())),
4659
)
4760
return pii_annotation_udf
61+
62+
63+
def ensure_installed(self, package_name):
64+
try:
65+
importlib.import_module(package_name)
66+
except ImportError:
67+
subprocess.check_call([sys.executable, "-m", "pip", "install", package_name])

0 commit comments

Comments
 (0)