Skip to content

Refactor database and added entity search #51

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 14 commits into
base: dev
Choose a base branch
from
Open

Conversation

jedzill4
Copy link
Contributor

@jedzill4 jedzill4 commented May 25, 2025

Introduce CRUD operations for documents and models, improve data validation and caching, and streamline build commands. Enhance entity resolution and dynamic grouping for the anonymization model.

Summary by Sourcery

Introduce full database-backed persistence and CRUD support for documents, paragraphs, models, and predictions, enhance anonymizer and document extraction endpoints with caching, validation, and model versioning, and streamline build commands.

New Features:

  • Implement CRUD for documents, paragraphs, models, and predictions with database schemas and migration
  • Persist anonymizer predictions and validations in the database with caching and retrieval
  • Enable document extraction endpoint to store and retrieve extracted paragraphs from the database

Enhancements:

  • Introduce model registration dependency and APP_VERSION for consistent versioning
  • Use functools.cache and blake2b hashing for UUID generation and prediction caching
  • Add ERROR_HANDLER setting to control validation error behavior and improve data validation
  • Update API startup to reflect dynamic versioning

Build:

  • Streamline build by adding uv build target in the Makefile

Documentation:

  • Add Jupyter notebook for entity grouping experiments

Solve #45

jedzill4 added 9 commits April 6, 2025 01:16
…dling and validation

Enhances document and model data handling

Adds CRUD operations for documents and models to improve data management.
Improves data validation and caching mechanisms for predictions.
Registers models in the database and updates document extraction logic.
…or documents, paragraphs, and predictions

refactor(api): replace DocLabel definition and update DocumentInformation to allow optional labels

feat(entities): add DocLabel class for document labeling functionality

feat(xml): introduce XML document handling with classes for text fragments and metadata

chore(settings): add new settings for Swagger UI and development mode, enhance environment loading with logging

feat(anonymization): implement core anonymization logic with alignment and XML handling for DOCX files

feat(xml_docx): add functions to unzip, normalize, and replace text in XML documents for anonymization
…function

🔧 fix: Clean up file opening syntax in load_pickle function
… and update aymurai package version in lock file
Copy link

sourcery-ai bot commented May 25, 2025

Reviewer's Guide

This PR refactors anonymization and document-extract endpoints to leverage persistent storage and caching via new database schemas and CRUD layers, enhances model registration and validation flows, adds a full Alembic migration for the updated schema, and streamlines configuration and build commands.

Sequence Diagram: Anonymizer Prediction with Caching

sequenceDiagram
    actor Client
    participant AnonymizerEndpoint as API
    participant Database
    participant MLModel as FlairAnonymizerModel

    Client->>API: POST /predict (text_request, use_cache)
    API->>Database: get_model("flair-anonymizer", app_version)
    Database-->>API: Model object
    API->>API: prediction_id = text_to_uuid(text + model.id)

    alt use_cache is true
        API->>Database: read_prediction(text, model.id)
        alt Cached prediction exists
            Database-->>API: Cached PredictionData
            API-->>Client: DocumentInformation (from cache)
        else Cache miss
            API->>FlairAnonymizerModel: predict_single(text)
            FlairAnonymizerModel-->>API: Predicted Labels
            API->>API: Create PredictionCreate object
            API->>Database: Save Prediction (input, labels, model.id)
            Database-->>API: Saved PredictionData
            API-->>Client: DocumentInformation (newly predicted)
        end
    else use_cache is false (predict but don't load or save to cache)
        API->>FlairAnonymizerModel: predict_single(text)
        FlairAnonymizerModel-->>API: Predicted Labels
        API-->>Client: DocumentInformation (newly predicted, not cached)
    end
Loading

Sequence Diagram: Retrieving Paragraph Validations

sequenceDiagram
    actor Client
    participant AnonymizerValidateEndpoint as API
    participant Database

    Client->>API: POST /validate (text_request)
    API->>Database: get_model(...)
    Database-->>API: Model object
    API->>Database: read_validation(text, model.name)
    Note right of Database: SELECT Prediction JOIN Model WHERE input_hash=hash(text) AND Model.name=model.name AND Prediction.validation IS NOT NULL
    alt Validation exists
        Database-->>API: Prediction object (with validation)
        API-->>Client: prediction.validation (list of DocLabel)
    else Validation does not exist
        Database-->>API: None
        API-->>Client: None
    end
Loading

Sequence Diagram: Document Anonymization and Validation Update

sequenceDiagram
    actor Client
    participant AnonymizerEndpoint as API
    participant Database
    participant DocAnonymizer

    Client->>API: POST /anonymize-document (document_id, annotations)
    API->>Database: get_model(...)
    Database-->>API: Model object
    API->>Database: Get Document by document_id
    Database-->>API: Document data (document.data, document.paragraphs)
    Note over API: Perform Sanity Checks (doc exists, paragraph counts match, etc.)

    loop for each paragraph_annotation in annotations
        API->>Database: read_prediction(paragraph_annotation.document, model.id)
        alt Prediction exists
            Database-->>API: Existing Prediction
            API->>API: Update prediction.validation = paragraph_annotation.labels
        else Prediction does not exist
            API->>API: Create new Prediction (input, fk_model, validation=labels)
            Note over API: Behavior might vary based on settings.ERROR_HANDLER
        end
        API->>Database: Add/Update Prediction in session (session.add(prediction))
    end
    API->>Database: Commit session (session.commit())

    API->>API: Retrieve document binary data from Document object
    API->>API: Create temporary file with document binary data
    API->>DocAnonymizer: anonymize(temp_file_path, annotations)
    DocAnonymizer-->>API: Anonymized document path/data
    API->>API: Create FileResponse (e.g., ODT format)
    API-->>Client: FileResponse (anonymized document)
Loading

Sequence Diagram: Document Text Extraction with Caching

sequenceDiagram
    actor Client
    participant DocumentExtractEndpoint as API
    participant Database
    participant AymurAIPipeline as TextExtractor

    Client->>API: POST /document-extract (file, use_cache)
    API->>API: data = file.read()
    API->>API: document_id = data_to_uuid(data)

    alt use_cache is true
        API->>Database: Get Document by document_id
        alt Document exists in DB
            Database-->>API: Cached Document object (with Paragraphs)
            API-->>Client: DocumentPublic (from DB)
        else Document not in DB (cache miss)
            API->>TextExtractor: process_document(file_data_in_temp_file)
            TextExtractor-->>API: Extracted text (doc_text)
            API->>API: Create Paragraph objects from doc_text
            API->>API: Create Document object (id, data, name, paragraphs)
            API->>Database: Save Document and Paragraphs (session.add_all, session.commit)
            Database-->>API: Saved Document object
            API-->>Client: DocumentPublic (newly extracted and saved)
        end
    else use_cache is false (force re-extraction and save/update)
        API->>Database: Get Document by document_id (this load is for potential update, not early return)
        API->>TextExtractor: process_document(file_data_in_temp_file)
        TextExtractor-->>API: Extracted text (doc_text)
        API->>API: Create Paragraph objects from doc_text
        API->>API: Create/Prepare Document object for saving (id, data, name, paragraphs)
        API->>Database: Save/Update Document and Paragraphs (session.add_all, session.commit)
        Database-->>API: Saved/Updated Document object
        API-->>Client: DocumentPublic (re-extracted and saved/updated)
    end
Loading

Entity Relationship Diagram for New Database Schema

erDiagram
    Model {
        UUID id PK
        String name
        String version
        DateTime created_at
    }
    Document {
        UUID id PK
        LargeBinary data
        String name
        DateTime created_at
        DateTime updated_at
    }
    Paragraph {
        UUID id PK
        String text
        UUID fk_document FK
        DateTime created_at
        DateTime updated_at
    }
    Prediction {
        UUID id PK
        String input
        String input_hash
        JSON prediction_data "prediction"
        JSON validation_data "validation"
        UUID fk_model FK
        UUID fk_paragraph FK "nullable"
        DateTime created_at
        DateTime updated_at
    }

    Model ||--o{ Prediction : "generates"
    Document ||--o{ Paragraph : "contains"
    Paragraph ||--o{ Prediction : "can be associated with"
Loading

Class Diagram for Database Schema and Public Models

classDiagram
    class ModelBase {
        +UUID id
        +String name
        +String version
        +validate_name_version()
    }
    class Model {
        +DateTime created_at
        +List~Prediction~ predictions
    }
    ModelBase <|-- Model
    class ModelCreate {
        +compile() Model
    }
    ModelBase <|-- ModelCreate
    class ModelPublic {
        +UUID id
        +String name
        +String version
        +DateTime created_at
    }

    class Document {
        +UUID id
        +LargeBinary data
        +String name
        +DateTime created_at
        +DateTime updated_at
        +List~Paragraph~ paragraphs
    }
    class DocumentPublic {
        +UUID id
        +String name
        +List~ParagraphPublic~ paragraphs
    }

    class ParagraphBase {
        +UUID id
        +String text
        +validate_text()
        #UUID hash
    }
    class Paragraph {
        +DateTime created_at
        +DateTime updated_at
        +UUID fk_document
        +Document document
        +List~Prediction~ predictions
    }
    ParagraphBase <|-- Paragraph
    class ParagraphPublic {
        +UUID id
        +String text
        +UUID hash
        +DateTime created_at
        +DateTime updated_at
    }

    class PredictionBase {
        +UUID id
        +String input
        +String input_hash
        +List~DocLabel~ prediction
        +List~DocLabel~ validation
        +UUID fk_model
        +UUID fk_paragraph
        +validate_input()
    }
    class Prediction {
        +String input_hash
        +DateTime created_at
        +DateTime updated_at
        +Model model
        +Paragraph paragraph
    }
    PredictionBase <|-- Prediction
    class PredictionCreate {
        +compile() Prediction
    }
    PredictionBase <|-- PredictionCreate
    class PredictionPublic {
        +UUID id
        +DateTime created_at
        +DateTime updated_at
        +String input
        +List~DocLabel~ prediction
        +List~DocLabel~ validation
        +ModelPublic model
    }
    class DocLabel{
        <<DataType>>
        +String text
        +String label
        +int start
        +int end
        +float score
        +dict attrs
    }
    PredictionBase ..> DocLabel : uses
    PredictionPublic ..> DocLabel : uses

    Model "1" -- "*" Prediction : predictions
    Prediction "*" -- "1" Model : model

    Document "1" -- "*" Paragraph : paragraphs
    Paragraph "*" -- "1" Document : document

    Paragraph "1" -- "*" Prediction : predictions
    Prediction "*" -- "0..1" Paragraph : paragraph
Loading

File-Level Changes

Change Details Files
Refactor anonymizer endpoints with model dependency, caching, and structured DB CRUD
  • Injected get_model cache dependency and register_model usage
  • Replaced in-memory paragraph cache with read_prediction/write Prediction ORM calls
  • Added sanity checks (input/output text mismatch) and label validation
  • Updated request signatures and response flows to use PredictionCreate and DocumentInformation
aymurai/api/endpoints/routers/anonymizer/anonymizer.py
Persist documents and paragraphs with CRUD and caching in document-extract endpoint
  • Changed to compute document_id via cached data_to_uuid and reuse existing Document
  • Stored extracted paragraphs as Paragraph entities with order and cleaned whitespace
  • Updated response model to DocumentPublic and removed inline JSON parsing
aymurai/api/endpoints/routers/misc/document_extract.py
aymurai/database/utils.py
Implemented persistent model registry and prediction CRUD layers
  • Added database/meta and crud modules for Model, Document, Prediction, Paragraph
  • Introduced read_prediction and read_validation functions using text hashing
  • Created register_model CRUD to auto-register and reuse Model records
aymurai/database/crud/prediction.py
aymurai/database/crud/document.py
aymurai/database/crud/model.py
aymurai/database/meta
Defined new database schema and migration script
  • Added Alembic revision to create tables for document, paragraph, model, prediction, and related join tables
  • Specified primary keys, foreign keys, JSON fields, and constraints
aymurai/database/versions/6a418ffd84da_create_database.py
Streamlined build and configuration management
  • Integrated APP_VERSION and ERROR_HANDLER into Settings and replaced hard-coded version
  • Updated FastAPI app version and startup logs to use settings.APP_VERSION
  • Simplified Makefile build step with uv build
aymurai/settings.py
aymurai/api/main.py
Makefile
Added entity grouping notebook for dynamic grouping
  • Introduced Jupyter notebook with experiments on entity clustering and grouping
  • Demonstrates DB-driven extract and prediction loop and clustering logic
notebooks/experiments/anonymization/05-entity-grouping/entitty-groups.ipynb

Tips and commands

Interacting with Sourcery

  • Trigger a new review: Comment @sourcery-ai review on the pull request.
  • Continue discussions: Reply directly to Sourcery's review comments.
  • Generate a GitHub issue from a review comment: Ask Sourcery to create an
    issue from a review comment by replying to it. You can also reply to a
    review comment with @sourcery-ai issue to create an issue from it.
  • Generate a pull request title: Write @sourcery-ai anywhere in the pull
    request title to generate a title at any time. You can also comment
    @sourcery-ai title on the pull request to (re-)generate the title at any time.
  • Generate a pull request summary: Write @sourcery-ai summary anywhere in
    the pull request body to generate a PR summary at any time exactly where you
    want it. You can also comment @sourcery-ai summary on the pull request to
    (re-)generate the summary at any time.
  • Generate reviewer's guide: Comment @sourcery-ai guide on the pull
    request to (re-)generate the reviewer's guide at any time.
  • Resolve all Sourcery comments: Comment @sourcery-ai resolve on the
    pull request to resolve all Sourcery comments. Useful if you've already
    addressed all the comments and don't want to see them anymore.
  • Dismiss all Sourcery reviews: Comment @sourcery-ai dismiss on the pull
    request to dismiss all existing Sourcery reviews. Especially useful if you
    want to start fresh with a new review - don't forget to comment
    @sourcery-ai review to trigger a new review!

Customizing Your Experience

Access your dashboard to:

  • Enable or disable review features such as the Sourcery-generated pull request
    summary, the reviewer's guide, and others.
  • Change the review language.
  • Add, remove or edit custom review instructions.
  • Adjust other review settings.

Getting Help

Copy link

@sourcery-ai sourcery-ai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @jedzill4 - I've reviewed your changes - here's some feedback:

Blocking issues:

  • Using 'file.name' without '.flush()' or '.close()' may cause an error because the file may not exist when 'file.name' is used. Use '.flush()' or close the file before using 'file.name'. (link)

General comments:

  • Consider using FastAPI’s HTTPException instead of raising ValueError for predictable client‐facing error responses with proper status codes.
  • The @cache decorator on get_model uses the session parameter (which is a new object each request), so it won’t actually cache—either remove the session from the cache key or cache only the model id/version.
  • The long Jupyter notebook under notebooks/experiments doesn’t belong in the API codebase—move or archive it elsewhere to keep production code lean.
Here's what I looked at during the review
  • 🟡 General issues: 7 issues found
  • 🔴 Security: 1 blocking issue
  • 🟢 Testing: all looks good
  • 🟢 Documentation: all looks good

Sourcery is free for open source - if you like our reviews please consider sharing them ✨
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.


cached_prediction = session.get(AnonymizationParagraph, paragraph_id)
input_text = text_request.text
prediction_id = text_to_uuid(f"{input_text}-{model.id}")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Unused prediction_id for cache writes

Assign prediction_id to the new Prediction object, or update the lookup to use input_hash instead, as session.get(Prediction, prediction_id) will not find the object otherwise.

# Save to cache
#################################################################################
logger.info(f"saving in cache: {prediction_id}")
pred = session.get(Prediction, prediction_id)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): session.get lookup mismatches model key

session.get will return None because Prediction.id is not set to prediction_id. Use input_hash for the query or set Prediction.id to prediction_id to ensure correct updates.

Comment on lines 27 to 33
@router.post("/document-extract", response_model=DocumentPublic)
def plain_text_extractor(
file: UploadFile,
pipeline: AymurAIPipeline = Depends(get_pipeline_doc_extract),
) -> Document:
session: Session = Depends(get_session),
use_cache: bool = True,
) -> DocumentPublic:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: use_cache should be declared as a Query parameter

Wrap use_cache with Query(...) to make it a query parameter instead of a default body parameter.

Suggested change
@router.post("/document-extract", response_model=DocumentPublic)
def plain_text_extractor(
file: UploadFile,
pipeline: AymurAIPipeline = Depends(get_pipeline_doc_extract),
) -> Document:
session: Session = Depends(get_session),
use_cache: bool = True,
) -> DocumentPublic:
from fastapi import Query
@router.post("/document-extract", response_model=DocumentPublic)
def plain_text_extractor(
file: UploadFile,
pipeline: AymurAIPipeline = Depends(get_pipeline_doc_extract),
session: Session = Depends(get_session),
use_cache: bool = Query(True),
) -> DocumentPublic:


return Document(document=document, document_id=document_id)
return paragraph_text
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (bug_risk): Returning raw SQLModel instead of DocumentPublic

Convert the returned SQLModel instance to DocumentPublic using from_orm to match the declared response_model.

from aymurai.database.schema import Model


def read_validation(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: Docstring references incorrect parameter name

Update the docstring to reference 'text' and 'model_name' instead of 'input_hash'.

) as tmp_file:
tmp_filename = tmp_file.name
with tempfile.NamedTemporaryFile(suffix=suffix, delete=False, dir=tmp_dir) as file:
tmp_filename = file.name
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security (tempfile-without-flush): Using 'file.name' without '.flush()' or '.close()' may cause an error because the file may not exist when 'file.name' is used. Use '.flush()' or close the file before using 'file.name'.

Source: opengrep

Comment on lines 190 to 193
if not pred:
return None

return paragraph.validation
return pred.validation
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): We've found these issues:

Suggested change
if not pred:
return None
return paragraph.validation
return pred.validation
return None if not pred else pred.validation

return None

return paragraph.validation
return pred.validation


# MARK: Document Compilation
@router.post("/anonymize-document")
async def anonymizer_compile_document(
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

issue (code-quality): We've found these issues:


Explanation

The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.

How can you solve this?

It might be worth refactoring this function to make it shorter and more readable.

  • Reduce the function length by extracting pieces of functionality out into
    their own functions. This is the most important thing you can do - ideally a
    function should be less than 10 lines.
  • Reduce nesting, perhaps by introducing guard clauses to return early.
  • Ensure that variables are tightly scoped, so that code using related concepts
    sits together within the function rather than being scattered.

Comment on lines +23 to +24
exists = session.get(Document, id)
if exists:
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Use named expression to simplify assignment and conditional (use-named-expression)

Suggested change
exists = session.get(Document, id)
if exists:
if exists := session.get(Document, id):

Comment on lines +40 to +41
data = session.exec(statement).first()
return data
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable)

Suggested change
data = session.exec(statement).first()
return data
return session.exec(statement).first()

jedzill4 added 5 commits May 25, 2025 05:52
- Added concurrent processing for text extraction to enhance performance.
- Introduced a timeout mechanism to handle long-running extraction tasks.
- Removed unnecessary threading lock for cleaner code.
- Removed `libmagic1` from the Dockerfile to reduce image size.
- Cleaned up environment variable configurations in docker-compose files.
- Updated build process in GitHub Actions for better tag handling.
- Corrected the file path for the Dockerfile in the build-and-push steps.
- Ensures the workflow uses the correct Dockerfile location for building images.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant