Skip to content

Conversation

HatmanStack
Copy link

@HatmanStack HatmanStack commented Sep 23, 2025

Summary

Introduces conditional execution of classification and extraction steps for pattern2 and pattern3, enabling cost and performance optimizations for workflows that don't require full processing.

Changes

  • Implemented conditional routing in pattern2 and pattern3
  • Updated UI to display toggles for easy enable/disable
  • Updated documents to reflect changes

Use Case

Invoice Processing: Process large volumes for archival with OCR-only, then selectively enable extraction ( reprocess ) for payment processing. Result: costs only incurred when detailed data extraction is needed.

Benefits

  • Backward compatible: Existing workflows unchanged. Assessed downstream compatibility - when extraction is skipped, returning {"section_id": ..., "document": ...} maintains the ProcessResultsFunction's data contract (only extracts "document" key). When classification is skipped, returning {"document": ...OCR document...} with empty sections array allows ProcessSections Map state to iterate over empty array and produce empty ExtractionResults, preserving workflow integrity.
  • Cost savings: Skip expensive ML services when not needed
  • Performance: Reduce processing time from minutes to seconds for OCR-only workflows

Tasks

  • Decouple summarization from main processing flow
  • Decouple extraction from main processing flow
  • Decouple classification from main processing flow
  • Decouple ocr from main processing flow
  • Decouple assessment from main processing flow

@rstrahan
Copy link
Contributor

I like it @HatmanStack !

Implemented conditional routing in pattern2 and pattern3 state machines using Choice states

Is this comment accurate? From the code it looks like you went with the simpler approach I had already adopted for enabling/disabling assessment and summarization.. ie leave the state machine alone, but just make the step lambda a no-op if the feature is not enabled. I like that implementation, only this comment referencing 'Choice state' is confusing.

@rstrahan
Copy link
Contributor

We also have the option to disable OCR.. by setting OCR backend to None..
Now I'm thinking, purely for symmetry, that we might want to refact that to also have the same enable/disable toggle that we now have with the other steps.. The effect would be the same as setting backend to none, but user experience would be more consistent. What do you think?

image

@rstrahan
Copy link
Contributor

Thoughts on disabling classification..

  1. We already disable LLM based classification automatically if:
    a. there is just one class defined in the 'classes' part of the config... in this case we just assign all pages to that one class.
    b. there are matching filename or text regex values defined.. in this case, if regex matches (it can be easily forced to match by making it a '.*') then LLM is bypassed.

But leveraging these behaviors is admittedly a less obvious way to disable classification.. I'm guess it was not obvious to you, and likely not obvious to others.. so we can still introduce an enable/disable toggle

  1. When we disable classification, I think you need to assign a default class label to all pages, and create exactly one 'section' for that default class that spans all pages.. If there is just one class defined (per 1.a above), use it, otherwise assign a default label like 'CLASSIFICATION DISABLED'. This is important to preserve the document structure and avoid breaking any downstream processes that depend on that structure.

@rstrahan
Copy link
Contributor

Also, consider ripple effects, and try to provide guidance to user in the UI description of the toggle field..
For example, if you disable extraction, does that affect assessment, or will assessment function independently? I've not thought this all thought, but hoping you can :) Anticipate any pitfalls that might come from enable one thing and disabling another.

Copy link
Contributor

@rstrahan rstrahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments above.. Many Tx!

@HatmanStack
Copy link
Author

We also have the option to disable OCR.. by setting OCR backend to None.. Now I'm thinking, purely for symmetry, that we might want to refact that to also have the same enable/disable toggle that we now have with the other steps.. The effect would be the same as setting backend to none, but user experience would be more consistent. What do you think?

image

This seems logical my only hesitation is how BKB is going to use the artifacts for Ingestion. I'll do some testing and figure out if it's smart enough to work with the different formats.

@HatmanStack
Copy link
Author

But leveraging these behaviors is admittedly a less obvious way to disable classification.. I'm guess it was not obvious to you, and likely not obvious to others.. so we can still introduce an enable/disable toggle

had no clue :)

@rstrahan
Copy link
Contributor

What the status on this @HatmanStack - did you resolve / answer all the questions - do you feel it's ready? If so I can try to make some time today to inspect and possibly merge if there are no issues.

@HatmanStack
Copy link
Author

It feels close, I haven't checked on assessment interactions. Would also like to add a few more notes on exactly what I was trying to accomplish on some edits. Always worry about edge cases with this sort of thing. Maybe this afternoon?

@rstrahan
Copy link
Contributor

rstrahan commented Sep 25, 2025

Thanks.. no rush.. Might be best, since tomorrow is release day, to target merging it early next week to release next friday, rather than rush and risk breaking. Anyway, let me know clearly here when you feel you've done due diligence (i.e. tried to anticipate all the ways it might break :) ) and yoy feel it's ready.. Tx!

@HatmanStack
Copy link
Author

I think this thing is ready or at least needs a different set of eyes on it. One thing I noticed, we aren't preserving our assessment data anywhere? Maybe I'm just not seeing it in the output bucket?

@rstrahan
Copy link
Contributor

rstrahan commented Sep 27, 2025 via email

@HatmanStack
Copy link
Author

HatmanStack commented Sep 29, 2025

I folded everything in to the updated approach pulling all of the checks out of the services into patterns. This does cause unecessary processing in some cases. For instance when a user wants something like an OCR -> Summarize or OCR with a knowledge-base. We're still processing classification and extraction steps at least once.

f3bd5cbcreated isolated steps but the changes didn't fit well with your update.

I can take another run at isolating the steps for these use cases in a future PR. For now this gets us to a tune able state after the first run and preserves the intelligent processing logic. I also added a check for a summary in the summarization service so if present it's always displayed. Dig the npm lint check added to publish. All the best.

@HatmanStack HatmanStack force-pushed the feature/optional-classification-extraction branch 3 times, most recently from de3558a to 37ceba3 Compare September 29, 2025 12:09
Copy link
Contributor

@rstrahan rstrahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See comments.. Please review all comments carefully, review all the code changes in the PR, ensure logic is correct, and clean up all changes unrelated to the scope of the PR, so it's easier to review / test the changes. Tx.


from idp_common import bedrock, image, metrics, s3, utils
from idp_common.models import Document
from idp_common.models import Document
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks like a duplicate line added??

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may have been duplicates from a interactive rebase. The only service that should be changed now comes is summarization. With the added check for an existing summary. All others are from main.

# Assessment module dependencies
assessment = [
"Pillow==11.2.1", # For image handling
"PyMuPDF==1.25.5", # Need fitz for creating text confidence from raw text uri
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure I understand this addition.. can you clarify?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 1220 of granular service is a fallback for creating an OCR service for textConfidence from raw text uri. We don't have access to PyMuPDF for the OCR service call.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand it.. Or what it has to do with the scope of this PR.

"section_id": section_id,
"document": full_document.serialize_document(working_bucket, f"extraction_skip_{section_id}", logger)
}
if not normalize_boolean_value(extraction_config.get('enabled', False)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again - this logic looks wrong.. Please reexamine.. It should skip if either condition holds, not both.

    if section.extraction_result_uri and section.extraction_result_uri.strip():
        if not normalize_boolean_value(extraction_config.get('enabled', False)):
            logger.info(f"Skipping extraction for section {section_id} - already has extraction data: {section.extraction_result_uri}")         

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for finding this!


logger.info(f"OCR skipped - Response: {json.dumps(response, default=str)}")
return response
if not normalize_boolean_value(ocr_config.get('tuning', False)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What this tuning config properly again.. Should be enabled?
And as above, please check this logic.. Needs to be an OR condition. This logic seems very confusing to me..
If AI is helping you (as I suspect) please Do Not Trust it.. carefully review the code and ensure it makes sense.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

enabled didn't seem like the right fit for describing this feature as by default it's always enabled and then checked for existence on new runs. The PR allows us to decouple from the checks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my earlier comments.. I think we have unfortunately diverged in understanding the scope of this PR.. Can we get back to scoping it to simply allow classification and extraction to be optionally disabled through an 'enable' flag in the config.

required:
- features
properties:
tuning:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confused by 'tuning' - seems disconnected from enable true | false which is supposed to be the focus of this PR. Please remove all this (looks like hallucination or scope creep) and ensure PR is laser focused on just the classification/extraction enable/disable toggle.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not resolved - reopening.

GUARDRAIL_ID_AND_VERSION: !If [HasGuardrailConfig, !Sub "${BedrockGuardrailId}:${BedrockGuardrailVersion}", ""]
LOG_LEVEL: !Ref LogLevel
WORKING_BUCKET: !Ref WorkingBucket
OUTPUT_BUCKET: !Ref OutputBucket
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you adding OUTPUT_BUCKET to each function here? I see no related changes that require it, nor does it seem related to the scope of the PR.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only needed in summarization changed

if document.status != Status.FAILED:
document.status = Status.COMPLETED
try:
# Perserving Document Summary for Frontend Across DeCoupled Architecture
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please clarify the purpose of this new logic.. the comment doesn't explain it clearly to me.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reworked comment

@HatmanStack
Copy link
Author

@rstrahan Ready for review. This should be a fast review. Planning to revisit and turn off these features after your changes have settled.

Copy link
Contributor

@rstrahan rstrahan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't move this forward as is, mostly because the scope seems to be very different that I had initially expected.. See my comments.
I cannot prioritize the time to fully understand and test the concepts, logic, risk, benefits of everything you are attempting in this expanded PR.
Sorry @HatmanStack .. If you can pare it right back to the simple enable/disable behavior, it can be a low risk, quick win..


**Stack Deployment Parameters:**
- `ClassificationMethod`: Classification methodology to use (options: 'multimodalPageLevelClassification' or 'textbasedHolisticClassification')
- **OCR**: Control the ability to tune ocr via configuration file `ocr.tuning` property.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I am confused now by the scope of this PR, including the introduction of this 'tuning' concept. I had thought this would be a simple PR, a quick win, to introduce only an 'enable' property for classification and extraction, with the simple logic in the associated lambdas to skip the associated processing if enabled is False. (Like we have already for Summarization).
And so I am confused by why this would require introduction of new stack deployment parameters and these .tuning properties.

)
raise

return document
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Try to get rid of all the 'noisy' diffs.. makes it much harder to review the important changes.

if document.status != Status.FAILED:
document.status = Status.COMPLETED
try:
# When we Reprocess a document and don't enable Summarization we lose the summary in the UI
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know how 'reprocessing a document' has anything to do with the scope of the PR, which was simply (as the title suggests) to make Classification and Extraction optional.. Simple, safe, uncomplicated..
This seems to have evolved into something much more, which, while it may be great (if I understood it :) ) , is riskier and not the 'quick win' I'd hoped for.

# Assessment module dependencies
assessment = [
"Pillow==11.2.1", # For image handling
"PyMuPDF==1.25.5", # Need fitz for creating text confidence from raw text uri
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still don't understand it.. Or what it has to do with the scope of this PR.


logger.info(f"Classification skipped - Response: {json.dumps(response, default=str)}")
return response
if not normalize_boolean_value(classification_config.get('tuning', False)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unresolved.. Logic needs fixed, and let's get rid of 'tuning' and revert to the simple 'enabled' flag (same as we have for assessment and summarization)


logger.info(f"OCR skipped - Response: {json.dumps(response, default=str)}")
return response
if not normalize_boolean_value(ocr_config.get('tuning', False)):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my earlier comments.. I think we have unfortunately diverged in understanding the scope of this PR.. Can we get back to scoping it to simply allow classification and extraction to be optionally disabled through an 'enable' flag in the config.

required:
- features
properties:
tuning:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not resolved - reopening.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants