-
Notifications
You must be signed in to change notification settings - Fork 0
Refactor database and added entity search #51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Conversation
…dling and validation Enhances document and model data handling Adds CRUD operations for documents and models to improve data management. Improves data validation and caching mechanisms for predictions. Registers models in the database and updates document extraction logic.
…aragraph and Prediction models
…or documents, paragraphs, and predictions refactor(api): replace DocLabel definition and update DocumentInformation to allow optional labels feat(entities): add DocLabel class for document labeling functionality feat(xml): introduce XML document handling with classes for text fragments and metadata chore(settings): add new settings for Swagger UI and development mode, enhance environment loading with logging feat(anonymization): implement core anonymization logic with alignment and XML handling for DOCX files feat(xml_docx): add functions to unzip, normalize, and replace text in XML documents for anonymization
…function 🔧 fix: Clean up file opening syntax in load_pickle function
… and update aymurai package version in lock file
Reviewer's GuideThis PR refactors anonymization and document-extract endpoints to leverage persistent storage and caching via new database schemas and CRUD layers, enhances model registration and validation flows, adds a full Alembic migration for the updated schema, and streamlines configuration and build commands. Sequence Diagram: Anonymizer Prediction with CachingsequenceDiagram
actor Client
participant AnonymizerEndpoint as API
participant Database
participant MLModel as FlairAnonymizerModel
Client->>API: POST /predict (text_request, use_cache)
API->>Database: get_model("flair-anonymizer", app_version)
Database-->>API: Model object
API->>API: prediction_id = text_to_uuid(text + model.id)
alt use_cache is true
API->>Database: read_prediction(text, model.id)
alt Cached prediction exists
Database-->>API: Cached PredictionData
API-->>Client: DocumentInformation (from cache)
else Cache miss
API->>FlairAnonymizerModel: predict_single(text)
FlairAnonymizerModel-->>API: Predicted Labels
API->>API: Create PredictionCreate object
API->>Database: Save Prediction (input, labels, model.id)
Database-->>API: Saved PredictionData
API-->>Client: DocumentInformation (newly predicted)
end
else use_cache is false (predict but don't load or save to cache)
API->>FlairAnonymizerModel: predict_single(text)
FlairAnonymizerModel-->>API: Predicted Labels
API-->>Client: DocumentInformation (newly predicted, not cached)
end
Sequence Diagram: Retrieving Paragraph ValidationssequenceDiagram
actor Client
participant AnonymizerValidateEndpoint as API
participant Database
Client->>API: POST /validate (text_request)
API->>Database: get_model(...)
Database-->>API: Model object
API->>Database: read_validation(text, model.name)
Note right of Database: SELECT Prediction JOIN Model WHERE input_hash=hash(text) AND Model.name=model.name AND Prediction.validation IS NOT NULL
alt Validation exists
Database-->>API: Prediction object (with validation)
API-->>Client: prediction.validation (list of DocLabel)
else Validation does not exist
Database-->>API: None
API-->>Client: None
end
Sequence Diagram: Document Anonymization and Validation UpdatesequenceDiagram
actor Client
participant AnonymizerEndpoint as API
participant Database
participant DocAnonymizer
Client->>API: POST /anonymize-document (document_id, annotations)
API->>Database: get_model(...)
Database-->>API: Model object
API->>Database: Get Document by document_id
Database-->>API: Document data (document.data, document.paragraphs)
Note over API: Perform Sanity Checks (doc exists, paragraph counts match, etc.)
loop for each paragraph_annotation in annotations
API->>Database: read_prediction(paragraph_annotation.document, model.id)
alt Prediction exists
Database-->>API: Existing Prediction
API->>API: Update prediction.validation = paragraph_annotation.labels
else Prediction does not exist
API->>API: Create new Prediction (input, fk_model, validation=labels)
Note over API: Behavior might vary based on settings.ERROR_HANDLER
end
API->>Database: Add/Update Prediction in session (session.add(prediction))
end
API->>Database: Commit session (session.commit())
API->>API: Retrieve document binary data from Document object
API->>API: Create temporary file with document binary data
API->>DocAnonymizer: anonymize(temp_file_path, annotations)
DocAnonymizer-->>API: Anonymized document path/data
API->>API: Create FileResponse (e.g., ODT format)
API-->>Client: FileResponse (anonymized document)
Sequence Diagram: Document Text Extraction with CachingsequenceDiagram
actor Client
participant DocumentExtractEndpoint as API
participant Database
participant AymurAIPipeline as TextExtractor
Client->>API: POST /document-extract (file, use_cache)
API->>API: data = file.read()
API->>API: document_id = data_to_uuid(data)
alt use_cache is true
API->>Database: Get Document by document_id
alt Document exists in DB
Database-->>API: Cached Document object (with Paragraphs)
API-->>Client: DocumentPublic (from DB)
else Document not in DB (cache miss)
API->>TextExtractor: process_document(file_data_in_temp_file)
TextExtractor-->>API: Extracted text (doc_text)
API->>API: Create Paragraph objects from doc_text
API->>API: Create Document object (id, data, name, paragraphs)
API->>Database: Save Document and Paragraphs (session.add_all, session.commit)
Database-->>API: Saved Document object
API-->>Client: DocumentPublic (newly extracted and saved)
end
else use_cache is false (force re-extraction and save/update)
API->>Database: Get Document by document_id (this load is for potential update, not early return)
API->>TextExtractor: process_document(file_data_in_temp_file)
TextExtractor-->>API: Extracted text (doc_text)
API->>API: Create Paragraph objects from doc_text
API->>API: Create/Prepare Document object for saving (id, data, name, paragraphs)
API->>Database: Save/Update Document and Paragraphs (session.add_all, session.commit)
Database-->>API: Saved/Updated Document object
API-->>Client: DocumentPublic (re-extracted and saved/updated)
end
Entity Relationship Diagram for New Database SchemaerDiagram
Model {
UUID id PK
String name
String version
DateTime created_at
}
Document {
UUID id PK
LargeBinary data
String name
DateTime created_at
DateTime updated_at
}
Paragraph {
UUID id PK
String text
UUID fk_document FK
DateTime created_at
DateTime updated_at
}
Prediction {
UUID id PK
String input
String input_hash
JSON prediction_data "prediction"
JSON validation_data "validation"
UUID fk_model FK
UUID fk_paragraph FK "nullable"
DateTime created_at
DateTime updated_at
}
Model ||--o{ Prediction : "generates"
Document ||--o{ Paragraph : "contains"
Paragraph ||--o{ Prediction : "can be associated with"
Class Diagram for Database Schema and Public ModelsclassDiagram
class ModelBase {
+UUID id
+String name
+String version
+validate_name_version()
}
class Model {
+DateTime created_at
+List~Prediction~ predictions
}
ModelBase <|-- Model
class ModelCreate {
+compile() Model
}
ModelBase <|-- ModelCreate
class ModelPublic {
+UUID id
+String name
+String version
+DateTime created_at
}
class Document {
+UUID id
+LargeBinary data
+String name
+DateTime created_at
+DateTime updated_at
+List~Paragraph~ paragraphs
}
class DocumentPublic {
+UUID id
+String name
+List~ParagraphPublic~ paragraphs
}
class ParagraphBase {
+UUID id
+String text
+validate_text()
#UUID hash
}
class Paragraph {
+DateTime created_at
+DateTime updated_at
+UUID fk_document
+Document document
+List~Prediction~ predictions
}
ParagraphBase <|-- Paragraph
class ParagraphPublic {
+UUID id
+String text
+UUID hash
+DateTime created_at
+DateTime updated_at
}
class PredictionBase {
+UUID id
+String input
+String input_hash
+List~DocLabel~ prediction
+List~DocLabel~ validation
+UUID fk_model
+UUID fk_paragraph
+validate_input()
}
class Prediction {
+String input_hash
+DateTime created_at
+DateTime updated_at
+Model model
+Paragraph paragraph
}
PredictionBase <|-- Prediction
class PredictionCreate {
+compile() Prediction
}
PredictionBase <|-- PredictionCreate
class PredictionPublic {
+UUID id
+DateTime created_at
+DateTime updated_at
+String input
+List~DocLabel~ prediction
+List~DocLabel~ validation
+ModelPublic model
}
class DocLabel{
<<DataType>>
+String text
+String label
+int start
+int end
+float score
+dict attrs
}
PredictionBase ..> DocLabel : uses
PredictionPublic ..> DocLabel : uses
Model "1" -- "*" Prediction : predictions
Prediction "*" -- "1" Model : model
Document "1" -- "*" Paragraph : paragraphs
Paragraph "*" -- "1" Document : document
Paragraph "1" -- "*" Prediction : predictions
Prediction "*" -- "0..1" Paragraph : paragraph
File-Level Changes
Tips and commandsInteracting with Sourcery
Customizing Your ExperienceAccess your dashboard to:
Getting Help
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @jedzill4 - I've reviewed your changes - here's some feedback:
Blocking issues:
- Using 'file.name' without '.flush()' or '.close()' may cause an error because the file may not exist when 'file.name' is used. Use '.flush()' or close the file before using 'file.name'. (link)
General comments:
- Consider using FastAPI’s HTTPException instead of raising ValueError for predictable client‐facing error responses with proper status codes.
- The
@cache
decorator onget_model
uses thesession
parameter (which is a new object each request), so it won’t actually cache—either remove the session from the cache key or cache only the model id/version. - The long Jupyter notebook under
notebooks/experiments
doesn’t belong in the API codebase—move or archive it elsewhere to keep production code lean.
Here's what I looked at during the review
- 🟡 General issues: 7 issues found
- 🔴 Security: 1 blocking issue
- 🟢 Testing: all looks good
- 🟢 Documentation: all looks good
Help me be more useful! Please click 👍 or 👎 on each comment and I'll use the feedback to improve your reviews.
|
||
cached_prediction = session.get(AnonymizationParagraph, paragraph_id) | ||
input_text = text_request.text | ||
prediction_id = text_to_uuid(f"{input_text}-{model.id}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Unused prediction_id for cache writes
Assign prediction_id to the new Prediction object, or update the lookup to use input_hash instead, as session.get(Prediction, prediction_id) will not find the object otherwise.
# Save to cache | ||
################################################################################# | ||
logger.info(f"saving in cache: {prediction_id}") | ||
pred = session.get(Prediction, prediction_id) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): session.get lookup mismatches model key
session.get will return None because Prediction.id is not set to prediction_id. Use input_hash for the query or set Prediction.id to prediction_id to ensure correct updates.
@router.post("/document-extract", response_model=DocumentPublic) | ||
def plain_text_extractor( | ||
file: UploadFile, | ||
pipeline: AymurAIPipeline = Depends(get_pipeline_doc_extract), | ||
) -> Document: | ||
session: Session = Depends(get_session), | ||
use_cache: bool = True, | ||
) -> DocumentPublic: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: use_cache should be declared as a Query parameter
Wrap use_cache with Query(...) to make it a query parameter instead of a default body parameter.
@router.post("/document-extract", response_model=DocumentPublic) | |
def plain_text_extractor( | |
file: UploadFile, | |
pipeline: AymurAIPipeline = Depends(get_pipeline_doc_extract), | |
) -> Document: | |
session: Session = Depends(get_session), | |
use_cache: bool = True, | |
) -> DocumentPublic: | |
from fastapi import Query | |
@router.post("/document-extract", response_model=DocumentPublic) | |
def plain_text_extractor( | |
file: UploadFile, | |
pipeline: AymurAIPipeline = Depends(get_pipeline_doc_extract), | |
session: Session = Depends(get_session), | |
use_cache: bool = Query(True), | |
) -> DocumentPublic: |
|
||
return Document(document=document, document_id=document_id) | ||
return paragraph_text |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (bug_risk): Returning raw SQLModel instead of DocumentPublic
Convert the returned SQLModel instance to DocumentPublic using from_orm to match the declared response_model.
from aymurai.database.schema import Model | ||
|
||
|
||
def read_validation( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: Docstring references incorrect parameter name
Update the docstring to reference 'text' and 'model_name' instead of 'input_hash'.
) as tmp_file: | ||
tmp_filename = tmp_file.name | ||
with tempfile.NamedTemporaryFile(suffix=suffix, delete=False, dir=tmp_dir) as file: | ||
tmp_filename = file.name |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
security (tempfile-without-flush): Using 'file.name' without '.flush()' or '.close()' may cause an error because the file may not exist when 'file.name' is used. Use '.flush()' or close the file before using 'file.name'.
Source: opengrep
if not pred: | ||
return None | ||
|
||
return paragraph.validation | ||
return pred.validation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): We've found these issues:
- Lift code into else after jump in control flow (
reintroduce-else
) - Replace if statement with if expression (
assign-if-exp
)
if not pred: | |
return None | |
return paragraph.validation | |
return pred.validation | |
return None if not pred else pred.validation |
return None | ||
|
||
return paragraph.validation | ||
return pred.validation | ||
|
||
|
||
# MARK: Document Compilation | ||
@router.post("/anonymize-document") | ||
async def anonymizer_compile_document( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
issue (code-quality): We've found these issues:
- Remove unnecessary calls to
enumerate
when the index is not used (remove-unused-enumerate
) - Hoist repeated code outside conditional statement (
hoist-statement-from-if
) - Explicitly raise from a previous error (
raise-from-previous-error
) - Low code quality found in anonymizer_compile_document - 25% (
low-code-quality
)
Explanation
The quality score for this function is below the quality threshold of 25%.
This score is a combination of the method length, cognitive complexity and working memory.
How can you solve this?
It might be worth refactoring this function to make it shorter and more readable.
- Reduce the function length by extracting pieces of functionality out into
their own functions. This is the most important thing you can do - ideally a
function should be less than 10 lines. - Reduce nesting, perhaps by introducing guard clauses to return early.
- Ensure that variables are tightly scoped, so that code using related concepts
sits together within the function rather than being scattered.
exists = session.get(Document, id) | ||
if exists: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Use named expression to simplify assignment and conditional (use-named-expression
)
exists = session.get(Document, id) | |
if exists: | |
if exists := session.get(Document, id): |
data = session.exec(statement).first() | ||
return data |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion (code-quality): Inline variable that is immediately returned (inline-immediately-returned-variable
)
data = session.exec(statement).first() | |
return data | |
return session.exec(statement).first() |
- Added concurrent processing for text extraction to enhance performance. - Introduced a timeout mechanism to handle long-running extraction tasks. - Removed unnecessary threading lock for cleaner code.
- Removed `libmagic1` from the Dockerfile to reduce image size. - Cleaned up environment variable configurations in docker-compose files. - Updated build process in GitHub Actions for better tag handling.
- Corrected the file path for the Dockerfile in the build-and-push steps. - Ensures the workflow uses the correct Dockerfile location for building images.
Introduce CRUD operations for documents and models, improve data validation and caching, and streamline build commands. Enhance entity resolution and dynamic grouping for the anonymization model.
Summary by Sourcery
Introduce full database-backed persistence and CRUD support for documents, paragraphs, models, and predictions, enhance anonymizer and document extraction endpoints with caching, validation, and model versioning, and streamline build commands.
New Features:
Enhancements:
Build:
uv build
target in the MakefileDocumentation:
Solve #45