-
Notifications
You must be signed in to change notification settings - Fork 31
Optional Classification and Extraction Steps #63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Optional Classification and Extraction Steps #63
Conversation
I like it @HatmanStack !
Is this comment accurate? From the code it looks like you went with the simpler approach I had already adopted for enabling/disabling assessment and summarization.. ie leave the state machine alone, but just make the step lambda a no-op if the feature is not enabled. I like that implementation, only this comment referencing 'Choice state' is confusing. |
Thoughts on disabling classification..
But leveraging these behaviors is admittedly a less obvious way to disable classification.. I'm guess it was not obvious to you, and likely not obvious to others.. so we can still introduce an enable/disable toggle
|
Also, consider ripple effects, and try to provide guidance to user in the UI description of the toggle field.. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments above.. Many Tx!
had no clue :) |
What the status on this @HatmanStack - did you resolve / answer all the questions - do you feel it's ready? If so I can try to make some time today to inspect and possibly merge if there are no issues. |
It feels close, I haven't checked on assessment interactions. Would also like to add a few more notes on exactly what I was trying to accomplish on some edits. Always worry about edge cases with this sort of thing. Maybe this afternoon? |
Thanks.. no rush.. Might be best, since tomorrow is release day, to target merging it early next week to release next friday, rather than rush and risk breaking. Anyway, let me know clearly here when you feel you've done due diligence (i.e. tried to anticipate all the ways it might break :) ) and yoy feel it's ready.. Tx! |
I think this thing is ready or at least needs a different set of eyes on it. One thing I noticed, we aren't preserving our assessment data anywhere? Maybe I'm just not seeing it in the output bucket? |
Great stuff! Will look on Monday.
Did you notice in yesterday's release I made classification and extraction optional in the context of the new 'edit sections' feature. Consider if that condition check could simply be expanded to check your new enabled flag as well.. That could be quite elegant I think.
Assessment data is added to the same result.json add extracted data, in a nested object called explainability_info.
Cheers
Bob
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: CJ ***@***.***>
Sent: Saturday, September 27, 2025 8:35:30 AM
To: aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws ***@***.***>
Cc: Strahan, Bob ***@***.***>; Comment ***@***.***>
Subject: Re: [aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws] Optional Classification and Extraction Steps (PR #63)
[https://avatars.githubusercontent.com/u/82614182?s=20&v=4]HatmanStack left a comment (aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws#63)<#63 (comment)>
I think this thing is ready or at least needs a different set of eyes on it. One thing I noticed, we aren't preserving our assessment data anywhere? Maybe I'm just not seeing it in the output bucket?
—
Reply to this email directly, view it on GitHub<#63 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACTSFHQ4JRVG6L5ENBQMTN33U2ABFAVCNFSM6AAAAACHHSVVI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGNBRGYZDMNBTGE>.
You are receiving this because you commented.Message ID: ***@***.***>
|
I folded everything in to the updated approach pulling all of the checks out of the services into patterns. This does cause unecessary processing in some cases. For instance when a user wants something like an OCR -> Summarize or OCR with a knowledge-base. We're still processing classification and extraction steps at least once. f3bd5cbcreated isolated steps but the changes didn't fit well with your update. I can take another run at isolating the steps for these use cases in a future PR. For now this gets us to a tune able state after the first run and preserves the intelligent processing logic. I also added a check for a summary in the summarization service so if present it's always displayed. Dig the npm lint check added to publish. All the best. |
de3558a
to
37ceba3
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See comments.. Please review all comments carefully, review all the code changes in the PR, ensure logic is correct, and clean up all changes unrelated to the scope of the PR, so it's easier to review / test the changes. Tx.
|
||
from idp_common import bedrock, image, metrics, s3, utils | ||
from idp_common.models import Document | ||
from idp_common.models import Document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks like a duplicate line added??
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may have been duplicates from a interactive rebase. The only service that should be changed now comes is summarization. With the added check for an existing summary. All others are from main.
# Assessment module dependencies | ||
assessment = [ | ||
"Pillow==11.2.1", # For image handling | ||
"PyMuPDF==1.25.5", # Need fitz for creating text confidence from raw text uri |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure I understand this addition.. can you clarify?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line 1220 of granular service is a fallback for creating an OCR service for textConfidence from raw text uri. We don't have access to PyMuPDF for the OCR service call.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand it.. Or what it has to do with the scope of this PR.
"section_id": section_id, | ||
"document": full_document.serialize_document(working_bucket, f"extraction_skip_{section_id}", logger) | ||
} | ||
if not normalize_boolean_value(extraction_config.get('enabled', False)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Again - this logic looks wrong.. Please reexamine.. It should skip if either condition holds, not both.
if section.extraction_result_uri and section.extraction_result_uri.strip():
if not normalize_boolean_value(extraction_config.get('enabled', False)):
logger.info(f"Skipping extraction for section {section_id} - already has extraction data: {section.extraction_result_uri}")
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for finding this!
|
||
logger.info(f"OCR skipped - Response: {json.dumps(response, default=str)}") | ||
return response | ||
if not normalize_boolean_value(ocr_config.get('tuning', False)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What this tuning
config properly again.. Should be enabled
?
And as above, please check this logic.. Needs to be an OR condition. This logic seems very confusing to me..
If AI is helping you (as I suspect) please Do Not Trust it.. carefully review the code and ensure it makes sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enabled didn't seem like the right fit for describing this feature as by default it's always enabled and then checked for existence on new runs. The PR allows us to decouple from the checks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my earlier comments.. I think we have unfortunately diverged in understanding the scope of this PR.. Can we get back to scoping it to simply allow classification and extraction to be optionally disabled through an 'enable' flag in the config.
required: | ||
- features | ||
properties: | ||
tuning: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confused by 'tuning' - seems disconnected from enable true | false which is supposed to be the focus of this PR. Please remove all this (looks like hallucination or scope creep) and ensure PR is laser focused on just the classification/extraction enable/disable toggle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not resolved - reopening.
patterns/pattern-3/template.yaml
Outdated
GUARDRAIL_ID_AND_VERSION: !If [HasGuardrailConfig, !Sub "${BedrockGuardrailId}:${BedrockGuardrailVersion}", ""] | ||
LOG_LEVEL: !Ref LogLevel | ||
WORKING_BUCKET: !Ref WorkingBucket | ||
OUTPUT_BUCKET: !Ref OutputBucket |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you adding OUTPUT_BUCKET to each function here? I see no related changes that require it, nor does it seem related to the scope of the PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only needed in summarization changed
if document.status != Status.FAILED: | ||
document.status = Status.COMPLETED | ||
try: | ||
# Perserving Document Summary for Frontend Across DeCoupled Architecture |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please clarify the purpose of this new logic.. the comment doesn't explain it clearly to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reworked comment
@rstrahan Ready for review. This should be a fast review. Planning to revisit and turn off these features after your changes have settled. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't move this forward as is, mostly because the scope seems to be very different that I had initially expected.. See my comments.
I cannot prioritize the time to fully understand and test the concepts, logic, risk, benefits of everything you are attempting in this expanded PR.
Sorry @HatmanStack .. If you can pare it right back to the simple enable/disable behavior, it can be a low risk, quick win..
|
||
**Stack Deployment Parameters:** | ||
- `ClassificationMethod`: Classification methodology to use (options: 'multimodalPageLevelClassification' or 'textbasedHolisticClassification') | ||
- **OCR**: Control the ability to tune ocr via configuration file `ocr.tuning` property. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm afraid I am confused now by the scope of this PR, including the introduction of this 'tuning' concept. I had thought this would be a simple PR, a quick win, to introduce only an 'enable' property for classification and extraction, with the simple logic in the associated lambdas to skip the associated processing if enabled is False. (Like we have already for Summarization).
And so I am confused by why this would require introduction of new stack deployment parameters and these .tuning properties.
) | ||
raise | ||
|
||
return document |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Try to get rid of all the 'noisy' diffs.. makes it much harder to review the important changes.
if document.status != Status.FAILED: | ||
document.status = Status.COMPLETED | ||
try: | ||
# When we Reprocess a document and don't enable Summarization we lose the summary in the UI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know how 'reprocessing a document' has anything to do with the scope of the PR, which was simply (as the title suggests) to make Classification and Extraction optional.. Simple, safe, uncomplicated..
This seems to have evolved into something much more, which, while it may be great (if I understood it :) ) , is riskier and not the 'quick win' I'd hoped for.
# Assessment module dependencies | ||
assessment = [ | ||
"Pillow==11.2.1", # For image handling | ||
"PyMuPDF==1.25.5", # Need fitz for creating text confidence from raw text uri |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still don't understand it.. Or what it has to do with the scope of this PR.
|
||
logger.info(f"Classification skipped - Response: {json.dumps(response, default=str)}") | ||
return response | ||
if not normalize_boolean_value(classification_config.get('tuning', False)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unresolved.. Logic needs fixed, and let's get rid of 'tuning' and revert to the simple 'enabled' flag (same as we have for assessment and summarization)
|
||
logger.info(f"OCR skipped - Response: {json.dumps(response, default=str)}") | ||
return response | ||
if not normalize_boolean_value(ocr_config.get('tuning', False)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See my earlier comments.. I think we have unfortunately diverged in understanding the scope of this PR.. Can we get back to scoping it to simply allow classification and extraction to be optionally disabled through an 'enable' flag in the config.
required: | ||
- features | ||
properties: | ||
tuning: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not resolved - reopening.
Summary
Introduces conditional execution of classification and extraction steps for pattern2 and pattern3, enabling cost and performance optimizations for workflows that don't require full processing.
Changes
Use Case
Invoice Processing: Process large volumes for archival with OCR-only, then selectively enable extraction ( reprocess ) for payment processing. Result: costs only incurred when detailed data extraction is needed.
Benefits
{"section_id": ..., "document": ...}
maintains the ProcessResultsFunction's data contract (only extracts "document" key). When classification is skipped, returning{"document": ...OCR document...}
with empty sections array allows ProcessSections Map state to iterate over empty array and produce empty ExtractionResults, preserving workflow integrity.Tasks