Optional Classification and Extraction Steps #63

HatmanStack · 2025-09-23T07:11:13Z

Summary

Introduces conditional execution of classification and extraction steps for pattern2 and pattern3, enabling cost and performance optimizations for workflows that don't require full processing.

Changes

Implemented conditional routing in pattern2 and pattern3
Updated UI to display toggles for easy enable/disable
Updated documents to reflect changes

Use Case

Invoice Processing: Process large volumes for archival with OCR-only, then selectively enable extraction ( reprocess ) for payment processing. Result: costs only incurred when detailed data extraction is needed.

Benefits

Backward compatible: Existing workflows unchanged. Assessed downstream compatibility - when extraction is skipped, returning {"section_id": ..., "document": ...} maintains the ProcessResultsFunction's data contract (only extracts "document" key). When classification is skipped, returning {"document": ...OCR document...} with empty sections array allows ProcessSections Map state to iterate over empty array and produce empty ExtractionResults, preserving workflow integrity.
Cost savings: Skip expensive ML services when not needed
Performance: Reduce processing time from minutes to seconds for OCR-only workflows

Tasks

Decouple summarization from main processing flow
Decouple extraction from main processing flow
Decouple classification from main processing flow
Decouple ocr from main processing flow
Decouple assessment from main processing flow

rstrahan · 2025-09-23T14:04:02Z

I like it @HatmanStack !

Implemented conditional routing in pattern2 and pattern3 state machines using Choice states

Is this comment accurate? From the code it looks like you went with the simpler approach I had already adopted for enabling/disabling assessment and summarization.. ie leave the state machine alone, but just make the step lambda a no-op if the feature is not enabled. I like that implementation, only this comment referencing 'Choice state' is confusing.

rstrahan · 2025-09-23T14:06:54Z

We also have the option to disable OCR.. by setting OCR backend to None..
Now I'm thinking, purely for symmetry, that we might want to refact that to also have the same enable/disable toggle that we now have with the other steps.. The effect would be the same as setting backend to none, but user experience would be more consistent. What do you think?

rstrahan · 2025-09-23T14:17:01Z

Thoughts on disabling classification..

We already disable LLM based classification automatically if:
a. there is just one class defined in the 'classes' part of the config... in this case we just assign all pages to that one class.
b. there are matching filename or text regex values defined.. in this case, if regex matches (it can be easily forced to match by making it a '.*') then LLM is bypassed.

But leveraging these behaviors is admittedly a less obvious way to disable classification.. I'm guess it was not obvious to you, and likely not obvious to others.. so we can still introduce an enable/disable toggle

When we disable classification, I think you need to assign a default class label to all pages, and create exactly one 'section' for that default class that spans all pages.. If there is just one class defined (per 1.a above), use it, otherwise assign a default label like 'CLASSIFICATION DISABLED'. This is important to preserve the document structure and avoid breaking any downstream processes that depend on that structure.

rstrahan · 2025-09-23T14:23:36Z

Also, consider ripple effects, and try to provide guidance to user in the UI description of the toggle field..
For example, if you disable extraction, does that affect assessment, or will assessment function independently? I've not thought this all thought, but hoping you can :) Anticipate any pitfalls that might come from enable one thing and disabling another.

rstrahan

See comments above.. Many Tx!

HatmanStack · 2025-09-23T16:43:02Z

We also have the option to disable OCR.. by setting OCR backend to None.. Now I'm thinking, purely for symmetry, that we might want to refact that to also have the same enable/disable toggle that we now have with the other steps.. The effect would be the same as setting backend to none, but user experience would be more consistent. What do you think?

This seems logical my only hesitation is how BKB is going to use the artifacts for Ingestion. I'll do some testing and figure out if it's smart enough to work with the different formats.

HatmanStack · 2025-09-23T16:46:33Z

But leveraging these behaviors is admittedly a less obvious way to disable classification.. I'm guess it was not obvious to you, and likely not obvious to others.. so we can still introduce an enable/disable toggle

had no clue :)

… EXTRACTION | SUMMARIZATION

rstrahan · 2025-09-25T13:02:24Z

What the status on this @HatmanStack - did you resolve / answer all the questions - do you feel it's ready? If so I can try to make some time today to inspect and possibly merge if there are no issues.

HatmanStack · 2025-09-25T13:13:59Z

It feels close, I haven't checked on assessment interactions. Would also like to add a few more notes on exactly what I was trying to accomplish on some edits. Always worry about edge cases with this sort of thing. Maybe this afternoon?

rstrahan · 2025-09-25T13:23:03Z

Thanks.. no rush.. Might be best, since tomorrow is release day, to target merging it early next week to release next friday, rather than rush and risk breaking. Anyway, let me know clearly here when you feel you've done due diligence (i.e. tried to anticipate all the ways it might break :) ) and yoy feel it's ready.. Tx!

HatmanStack · 2025-09-27T12:35:08Z

I think this thing is ready or at least needs a different set of eyes on it. One thing I noticed, we aren't preserving our assessment data anywhere? Maybe I'm just not seeing it in the output bucket?

rstrahan · 2025-09-27T14:44:52Z

Great stuff! Will look on Monday. Did you notice in yesterday's release I made classification and extraction optional in the context of the new 'edit sections' feature. Consider if that condition check could simply be expanded to check your new enabled flag as well.. That could be quite elegant I think. Assessment data is added to the same result.json add extracted data, in a nested object called explainability_info. Cheers Bob Get Outlook for Android<https://aka.ms/AAb9ysg>

…

________________________________ From: CJ ***@***.***> Sent: Saturday, September 27, 2025 8:35:30 AM To: aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws ***@***.***> Cc: Strahan, Bob ***@***.***>; Comment ***@***.***> Subject: Re: [aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws] Optional Classification and Extraction Steps (PR #63) [https://avatars.githubusercontent.com/u/82614182?s=20&v=4]HatmanStack left a comment (aws-solutions-library-samples/accelerated-intelligent-document-processing-on-aws#63)<#63 (comment)> I think this thing is ready or at least needs a different set of eyes on it. One thing I noticed, we aren't preserving our assessment data anywhere? Maybe I'm just not seeing it in the output bucket? — Reply to this email directly, view it on GitHub<#63 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ACTSFHQ4JRVG6L5ENBQMTN33U2ABFAVCNFSM6AAAAACHHSVVI6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTGNBRGYZDMNBTGE>. You are receiving this because you commented.Message ID: ***@***.***>

…-classification-extraction

HatmanStack · 2025-09-29T10:58:28Z

I folded everything in to the updated approach pulling all of the checks out of the services into patterns. This does cause unecessary processing in some cases. For instance when a user wants something like an OCR -> Summarize or OCR with a knowledge-base. We're still processing classification and extraction steps at least once.

f3bd5cbcreated isolated steps but the changes didn't fit well with your update.

I can take another run at isolating the steps for these use cases in a future PR. For now this gets us to a tune able state after the first run and preserves the intelligent processing logic. I also added a check for a summary in the summarization service so if present it's always displayed. Dig the npm lint check added to publish. All the best.

… EXTRACTION | SUMMARIZATION

rstrahan

See comments.. Please review all comments carefully, review all the code changes in the PR, ensure logic is correct, and clean up all changes unrelated to the scope of the PR, so it's easier to review / test the changes. Tx.

rstrahan · 2025-09-29T19:12:05Z

lib/idp_common_pkg/idp_common/extraction/service.py


 from idp_common import bedrock, image, metrics, s3, utils
 from idp_common.models import Document
+from idp_common.models import Document


looks like a duplicate line added??

may have been duplicates from a interactive rebase. The only service that should be changed now comes is summarization. With the added check for an existing summary. All others are from main.

lib/idp_common_pkg/idp_common/assessment/granular_service.py

rstrahan · 2025-09-29T19:18:21Z

lib/idp_common_pkg/pyproject.toml

 # Assessment module dependencies
 assessment = [
    "Pillow==11.2.1",  # For image handling
+    "PyMuPDF==1.25.5", # Need fitz for creating text confidence from raw text uri


Not sure I understand this addition.. can you clarify?

Line 1220 of granular service is a fallback for creating an OCR service for textConfidence from raw text uri. We don't have access to PyMuPDF for the OCR service call.

I still don't understand it.. Or what it has to do with the scope of this PR.

lib/idp_common_pkg/setup.py

patterns/pattern-2/src/classification_function/index.py

rstrahan · 2025-09-29T19:38:57Z

patterns/pattern-2/src/extraction_function/index.py

-            "section_id": section_id,
-            "document": full_document.serialize_document(working_bucket, f"extraction_skip_{section_id}", logger)
-        }
+        if not normalize_boolean_value(extraction_config.get('enabled', False)):


Again - this logic looks wrong.. Please reexamine.. It should skip if either condition holds, not both.

if section.extraction_result_uri and section.extraction_result_uri.strip(): if not normalize_boolean_value(extraction_config.get('enabled', False)): logger.info(f"Skipping extraction for section {section_id} - already has extraction data: {section.extraction_result_uri}")

Thanks for finding this!

rstrahan · 2025-09-29T19:41:59Z

patterns/pattern-2/src/ocr_function/index.py

-
-        logger.info(f"OCR skipped - Response: {json.dumps(response, default=str)}")
-        return response
+        if not normalize_boolean_value(ocr_config.get('tuning', False)):


What this tuning config properly again.. Should be enabled?
And as above, please check this logic.. Needs to be an OR condition. This logic seems very confusing to me..
If AI is helping you (as I suspect) please Do Not Trust it.. carefully review the code and ensure it makes sense.

enabled didn't seem like the right fit for describing this feature as by default it's always enabled and then checked for existence on new runs. The PR allows us to decouple from the checks.

See my earlier comments.. I think we have unfortunately diverged in understanding the scope of this PR.. Can we get back to scoping it to simply allow classification and extraction to be optionally disabled through an 'enable' flag in the config.

rstrahan · 2025-09-29T19:43:38Z

patterns/pattern-2/template.yaml

            required:
              - features
            properties:
+              tuning:


Confused by 'tuning' - seems disconnected from enable true | false which is supposed to be the focus of this PR. Please remove all this (looks like hallucination or scope creep) and ensure PR is laser focused on just the classification/extraction enable/disable toggle.

Not resolved - reopening.

rstrahan · 2025-09-29T19:46:24Z

patterns/pattern-3/template.yaml

          GUARDRAIL_ID_AND_VERSION: !If [HasGuardrailConfig, !Sub "${BedrockGuardrailId}:${BedrockGuardrailVersion}", ""]
          LOG_LEVEL: !Ref LogLevel
          WORKING_BUCKET: !Ref WorkingBucket
+          OUTPUT_BUCKET: !Ref OutputBucket


Why are you adding OUTPUT_BUCKET to each function here? I see no related changes that require it, nor does it seem related to the scope of the PR.

Only needed in summarization changed

rstrahan · 2025-09-29T19:50:59Z

lib/idp_common_pkg/idp_common/summarization/service.py

            if document.status != Status.FAILED:
                document.status = Status.COMPLETED
+            try:
+                # Perserving Document Summary for Frontend Across DeCoupled Architecture


Please clarify the purpose of this new logic.. the comment doesn't explain it clearly to me.

reworked comment

…ed missing bucket in Summ service

HatmanStack · 2025-10-02T12:29:56Z

@rstrahan Ready for review. This should be a fast review. Planning to revisit and turn off these features after your changes have settled.

rstrahan

I can't move this forward as is, mostly because the scope seems to be very different that I had initially expected.. See my comments.
I cannot prioritize the time to fully understand and test the concepts, logic, risk, benefits of everything you are attempting in this expanded PR.
Sorry @HatmanStack .. If you can pare it right back to the simple enable/disable behavior, it can be a low risk, quick win..

rstrahan · 2025-10-02T20:35:24Z

docs/pattern-2.md


 **Stack Deployment Parameters:**
 - `ClassificationMethod`: Classification methodology to use (options: 'multimodalPageLevelClassification' or 'textbasedHolisticClassification')
+- **OCR**: Control the ability to tune ocr via configuration file `ocr.tuning` property.


I'm afraid I am confused now by the scope of this PR, including the introduction of this 'tuning' concept. I had thought this would be a simple PR, a quick win, to introduce only an 'enable' property for classification and extraction, with the simple logic in the associated lambdas to skip the associated processing if enabled is False. (Like we have already for Summarization).
And so I am confused by why this would require introduction of new stack deployment parameters and these .tuning properties.

rstrahan · 2025-10-02T20:37:58Z

lib/idp_common_pkg/idp_common/classification/service.py

            )
            raise

-        return document


Try to get rid of all the 'noisy' diffs.. makes it much harder to review the important changes.

rstrahan · 2025-10-02T20:40:02Z

lib/idp_common_pkg/idp_common/summarization/service.py

            if document.status != Status.FAILED:
                document.status = Status.COMPLETED
+            try:
+                # When we Reprocess a document and don't enable Summarization we lose the summary in the UI


I do not know how 'reprocessing a document' has anything to do with the scope of the PR, which was simply (as the title suggests) to make Classification and Extraction optional.. Simple, safe, uncomplicated..
This seems to have evolved into something much more, which, while it may be great (if I understood it :) ) , is riskier and not the 'quick win' I'd hoped for.

rstrahan · 2025-10-02T20:41:10Z

lib/idp_common_pkg/pyproject.toml

 # Assessment module dependencies
 assessment = [
    "Pillow==11.2.1",  # For image handling
+    "PyMuPDF==1.25.5", # Need fitz for creating text confidence from raw text uri


I still don't understand it.. Or what it has to do with the scope of this PR.

rstrahan · 2025-10-02T20:43:23Z

patterns/pattern-2/src/classification_function/index.py

-
-        logger.info(f"Classification skipped - Response: {json.dumps(response, default=str)}")
-        return response
+        if not normalize_boolean_value(classification_config.get('tuning', False)):


Unresolved.. Logic needs fixed, and let's get rid of 'tuning' and revert to the simple 'enabled' flag (same as we have for assessment and summarization)

rstrahan · 2025-10-02T20:45:25Z

patterns/pattern-2/src/ocr_function/index.py

-
-        logger.info(f"OCR skipped - Response: {json.dumps(response, default=str)}")
-        return response
+        if not normalize_boolean_value(ocr_config.get('tuning', False)):


See my earlier comments.. I think we have unfortunately diverged in understanding the scope of this PR.. Can we get back to scoping it to simply allow classification and extraction to be optionally disabled through an 'enable' flag in the config.

rstrahan · 2025-10-02T20:46:28Z

patterns/pattern-2/template.yaml

            required:
              - features
            properties:
+              tuning:


Not resolved - reopening.

Corrected git | branch created from exsisting feature branch

a24400e

rstrahan requested changes Sep 23, 2025

View reviewed changes

HatmanStack added 3 commits September 23, 2025 22:09

Moved gating logic from patterns to services

7575f07

Decoupled Services from Main Processing Flow | OCR | CLASSIFICATION |…

4fcb327

… EXTRACTION | SUMMARIZATION

Surfacing errors to Processing Flow Ui for Extraction Step

1d09259

Decoupled Assessment

ef94dab

HatmanStack added 2 commits September 29, 2025 04:04

Merge remote-tracking branch 'upstream/develop' into feature/optional…

0d74b98

…-classification-extraction

reverted to original Flow with simple flag check

2e91f28

HatmanStack added 4 commits September 29, 2025 06:38

Decoupled Services from Main Processing Flow | OCR | CLASSIFICATION |…

41a4dad

… EXTRACTION | SUMMARIZATION

Surfacing errors to Processing Flow Ui for Extraction Step

f20862a

Decoupled Assessment

f3bd5cb

reverted to original Flow with simple flag check

37ceba3

HatmanStack force-pushed the feature/optional-classification-extraction branch 3 times, most recently from de3558a to 37ceba3 Compare September 29, 2025 12:09

rstrahan requested changes Sep 29, 2025

View reviewed changes

HatmanStack added 2 commits September 29, 2025 15:48

request changes made

c1281a1

removed interactive rebase leftovers

4982fbb

HatmanStack and others added 3 commits September 29, 2025 16:13

added in configs

22ace72

Removed excess Error collection in Summ Func | Updated docs | correct…

1b2ffc1

…ed missing bucket in Summ service

Merge branch 'main' into feature/optional-classification-extraction

9238d97

rstrahan requested changes Oct 2, 2025

View reviewed changes

Optional Classification and Extraction Steps #63

Are you sure you want to change the base?

Optional Classification and Extraction Steps #63

Uh oh!

Conversation

HatmanStack commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Use Case

Benefits

Tasks

Uh oh!

rstrahan commented Sep 23, 2025

Uh oh!

rstrahan commented Sep 23, 2025

Uh oh!

rstrahan commented Sep 23, 2025

Uh oh!

rstrahan commented Sep 23, 2025

Uh oh!

rstrahan left a comment

Choose a reason for hiding this comment

Uh oh!

HatmanStack commented Sep 23, 2025

Uh oh!

HatmanStack commented Sep 23, 2025

Uh oh!

rstrahan commented Sep 25, 2025

Uh oh!

HatmanStack commented Sep 25, 2025

Uh oh!

rstrahan commented Sep 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HatmanStack commented Sep 27, 2025

Uh oh!

rstrahan commented Sep 27, 2025 via email

Uh oh!

HatmanStack commented Sep 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rstrahan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HatmanStack commented Sep 23, 2025 •

edited

Loading

rstrahan commented Sep 25, 2025 •

edited

Loading

HatmanStack commented Sep 29, 2025 •

edited

Loading