typing: graduate opencontractserver.tasks.* from mypy baseline#1541
typing: graduate opencontractserver.tasks.* from mypy baseline#1541
Conversation
…#1482) Continues draining the mypy baseline established in #1331. Graduates all sixteen task modules out of the baseline. Removed from mypy.ini - opencontractserver.tasks.{agent_tasks, analyzer_tasks, badge_tasks, corpus_tasks, data_extract_tasks, doc_analysis_tasks, doc_tasks, embeddings_task, export_tasks, export_tasks_v2, extract_orchestrator_tasks, fork_tasks, import_tasks, import_tasks_v2, lookup_tasks, memory_tasks} Pruned from docs/typing/mypy_baseline.txt - 227 lines (6813 -> 6586) Highlights of the per-file fixes - badge_tasks: dropped the `User = get_user_model()` runtime-call-as-type pattern in favour of a direct ``from opencontractserver.users.models import User`` (mirrors the conftest pattern used elsewhere). - doc_analysis_tasks: filter Anthropic ``response.content`` blocks via ``hasattr(resp, "text")`` so list comprehensions on the union type are legal; renamed the accumulator to avoid the ``str | list`` shadowing. - analyzer_tasks: tightened ``request_gremlin_manifest`` signature to ``int`` and coerce ``add(*doc_ids)`` to ints since ``analyzed_documents`` rejects str. - data_extract_tasks: scoped ``# type: ignore[valid-type]`` on the ``output_type = list[output_type]`` runtime-parametric pattern; scoped ``# type: ignore[typeddict-item]`` on two ``TypedDict(**dict)`` expansions that mypy cannot prove cover all required keys. - doc_tasks: validate FieldFile.name before passing to Storage.open; threaded the ``parent_id`` field through the FUNSD annotation dict so it satisfies the TypedDict; cast ``page_data.bounds`` to BoundingBoxPythonType where the dataclass field is widened to ``dict[str, int | float]``; widened ``annotation_map`` keys to ``int | str`` to match the function's declared return type; coerced ``file_type`` to FileTypeEnum before calling ``get_components_by_mimetype``. - embeddings_task: cast ``_create_embedding_for_annotation`` to the ``HasEmbeddingMixin``-typed callable signature expected by ``_apply_dual_embedding_strategy``; explicit annotation ``result: dict[str, Any]`` so the mixed-shape result dict cooperates with subscripted update / append. - export_tasks(_v2): handled None returns from ``package_corpus_for_export`` / ``package_label_set_for_export`` by raising RuntimeError before constructing the TypedDict; replaced ``output_bytes`` (BytesIO) with ContentFile when calling FieldFile.save; renamed the inner FUNSD page-data variable to keep its ``tuple[int, str, str]`` type distinct from the bytes payload; fixed the ``UserExport.error`` typo (the actual field is ``errors``). - import_tasks(_v2): swapped ``AbstractBaseUser`` for the project's concrete ``User`` (same pattern as corpuses/folder_service); tightened the V2-only branch so ``data_json.get(...)`` returns the V2-typed values instead of the V1∩V2 ``object`` lower bound; widened aggregator dict to ``dict[str | int, int]`` to match ``import_doc_annotations``'s declared return type. - fork_tasks: replaced implicit ``Optional[list[str]] = None`` parameter defaults; guarded FieldFile.name accesses; annotated the empty ``folder_map`` / ``structural_set_map`` accumulators. - corpus_tasks: dropped a stale ``return summary`` from the ``-> None`` task; coerced the ``str | int`` celery args to ``int`` once at function entry; renamed the inner ``execution`` loop variable to avoid shadowing the outer ``for doc_id, execution in execution_map.items()`` binding; scoped ``# type: ignore[attr-defined]`` on Celery's dynamically-attached ``Signature.s`` / ``.si`` attributes. - extract_orchestrator_tasks: early-return when ``extract_id`` is None; scoped ``# type: ignore[attr-defined]`` on ``task_func.si``. Cross-cutting type tightening - ``LabelLookupPythonType.{text_labels, doc_labels}`` narrowed from ``dict[str | int, AnnotationLabelPythonType]`` to ``dict[str, ...]`` (the producer in ``utils.etl.build_label_lookups`` already keys by ``str(pk)``). Updated the ``corpus_id`` parameter on ``build_label_lookups`` from ``str`` to ``int`` to match every call site. - ``utils.importing.load_or_create_labels`` and ``utils.importing.import_doc_annotations`` now accept ``Mapping`` instead of ``dict``, so callers can pass ``OpenContractDocExport`` / ``AnnotationLabelPythonType`` TypedDicts directly without ``cast()``. - ``import_tasks._validate_sidecar_schema`` and ``_apply_sidecar_annotations`` accept ``str | None`` for the ``sidecar_path`` argument since the caller passes a fallback name that can be None. Verification - python -m mypy --config-file mypy.ini opencontractserver config -> "Success: no issues found in 1032 source files" - All 16 graduated modules import cleanly under config.settings.mypy.
Code Review: PR #1541 — Graduate
|
Codecov Report❌ Patch coverage is 📢 Thoughts on this report? Let us know! |
Fixes: - Apply pyupgrade rewrites to import_tasks_v2.py - Fix FUNSD annotation_map key type regression: doc_tasks.py now uses int keys consistently, export_tasks.py looks up by int (not str(index)), and all return/parameter type annotations updated accordingly - Narrow dict[str|int,...] to dict[str,...] in etl.build_label_lookups and build_document_export (keys are always str(pk)) - Cast get_component_by_name() results in embeddings.py, embeddings_task.py and doc_tasks.py to the specific subtype (BaseEmbedder / BaseParser / BaseThumbnailGenerator) to satisfy mypy - Append FunsdAnnotationType directly rather than via dict() to preserve the TypedDict type
Tests added:
- test_convert_doc_to_funsd_annotation_map_uses_int_keys: regression guard
that annotation_map keys are int, not str (FUNSD key-type fix)
- test_convert_doc_to_funsd_empty_doc_returns_empty_map: empty-annotation path
- test_export_tasks_typing_fixes.py (new file):
- FinalizeExportTestCase: ContentFile save round-trip, seekable-buffer behaviour
- OnDemandPostProcessorsTestCase: RuntimeError on missing export file,
no-op when post_processors is empty
- PackageFunsdExportsIntKeyTestCase: int-key annotation lookup (regression),
empty-annotation pages, multi-page multi-annotation packaging
- PackageCorpusExportV2ErrorsFieldTestCase: export.errors field populated
on failure (was non-existent export.error before the fix)
Code Review — PR #1541: Graduate
|
- export_tasks_v2.py: reorder so the None check on package_corpus_for_export happens before the cast, allowing mypy to actually narrow the type (review #1541 nit). Cast is then a runtime no-op for the narrowed value. - test_export_tasks_typing_fixes.py: black reformat picked up by CI.
Code Review: PR #1541 — Graduate
|
Code Review — PR #1541: Graduate
|
…-8mA5R # Conflicts: # CHANGELOG.md
…ignore, rename UserModel alias - import_tasks_v2.py: walrus comprehension drops the redundant 'str(old_id) in annot_id_map' guard. dict.get returns None for unknown keys, so the 'is not None' check on the walrus result is sufficient (one dict lookup per element instead of two). - data_extract_tasks.py: explain why temperature=extract_temperature carries '# type: ignore[arg-type]'. The downstream stub declares temperature: float; we intentionally pass None for Claude models so the temperature=0 override engages. Tracked in #1381. - import_tasks.py: rename TYPE_CHECKING 'User as AbstractBaseUser' to 'User as UserModel' for consistency with import_tasks_v2.py. The alias is required because the runtime 'User = get_user_model()' would otherwise collide with the type-only import.
Code Review: PR #1541 — Typing graduation for
|
…GELOG Three follow-ups from the Claude review on PR #1541: - Replace assert s3 is not None in export_tasks.py with a real RuntimeError guard. Asserts are stripped under python -O and Celery tasks may run with optimisations enabled in production, so the assert effectively no-ops in the AWS branch. The new guard fails loud with a descriptive message — defence-in-depth for a branch that should be unreachable but covers any future ordering changes to the s3 initialisation above. - Add # TODO(#1482) references to the three # type: ignore suppressions in import_tasks_v2.py. The dict<->TypedDict bridging is not a permanent state — it goes away once unpack_label_set_from_export / unpack_corpus_from_export / prepare_import_labels widen their signatures to Mapping[str, Any] in the next typing-graduation chunk. The TODO refs ensure these ignores can't quietly outlive the issue they belong to. - Drop the stray double blank line between the new tasks-graduation changelog entry and the Docker hardening entry. The project norm is one blank line between entries.
Code ReviewOverviewThis PR graduates all 16 Real Bugs Fixed (worth calling out explicitly)These are not just annotation cleanup — they fix actual runtime or data-correctness problems:
FUNSD key type mismatch —
Type Narrowing Quality
Minor Issues / Suggestions
if not doc.pawls_parse_file.name or not doc.pdf_file.name:
raise ValueError(f"Document {doc_id} is missing pawls_parse_file or pdf_file")This is a behavioral improvement (fail-fast with a clear message vs. a confusing
Test CoverageThe new tests are solid:
One gap: there's no test for the SummaryHigh-quality typing graduation PR with real bug fixes bundled in. The three |
Closes #1482. Continues draining the mypy baseline established in #1331 (tracker: #1447).
Summary
Graduates all sixteen
opencontractserver.tasks.*modules out of the mypy baseline. The package is now type-clean.Removed from
mypy.ini— sixteen[mypy-…]blocks:agent_tasks,analyzer_tasks,badge_tasks,corpus_tasks,data_extract_tasks,doc_analysis_tasks,doc_tasks,embeddings_task,export_tasks,export_tasks_v2,extract_orchestrator_tasks,fork_tasks,import_tasks,import_tasks_v2,lookup_tasks,memory_tasks.Pruned from
docs/typing/mypy_baseline.txt— 227 lines (6813 → 6586).Notable per-file fixes
badge_tasks: dropped theUser = get_user_model()runtime-call-as-type pattern in favour of a directfrom opencontractserver.users.models import User(mirrors theconftest.pypattern used elsewhere in the project).doc_analysis_tasks: filter Anthropicresponse.contentblocks viahasattr(resp, "text")so list comprehensions over the union type are legal; renamed the accumulator to avoidstr | listshadowing.analyzer_tasks: tightenedrequest_gremlin_manifestsignature tointand coercedadd(*doc_ids)to ints sinceanalyzed_documentsrejectsstr.data_extract_tasks: scoped# type: ignore[valid-type]on theoutput_type = list[output_type]runtime-parametric pattern; scoped# type: ignore[typeddict-item]on twoTypedDict(**dict)expansions that mypy cannot prove cover all required keys.doc_tasks: validateFieldFile.namebeforeStorage.open; threadparent_idthrough the FUNSD annotation dict so it satisfies the TypedDict; castpage_data.boundstoBoundingBoxPythonType; widenannotation_mapkeys toint | strto match the declared return type; coercefile_typetoFileTypeEnumbeforeget_components_by_mimetype.embeddings_task: cast_create_embedding_for_annotationto theHasEmbeddingMixin-typed callable signature expected by_apply_dual_embedding_strategy; explicitresult: dict[str, Any]annotation so subscripted update / append cooperate with the mixed-shape result dict.export_tasks/export_tasks_v2: handleNonereturns frompackage_corpus_for_export/package_label_set_for_exportwithRuntimeErrorbefore constructing the TypedDict; replaceoutput_bytes(BytesIO) withContentFilewhen callingFieldFile.save; rename inner FUNSD page-data variable to keep itstuple[int, str, str]type distinct from the bytes payload; fixUserExport.errortypo (the actual field iserrors).import_tasks/import_tasks_v2: swapAbstractBaseUserfor the project's concreteUser(same pattern ascorpuses/folder_service); narrow the V2-only branch sodata_json.get(...)returns V2-typed values instead of the V1∩V2objectlower bound; widen aggregator dict todict[str | int, int]to matchimport_doc_annotations's return type.fork_tasks: replace implicitOptional[list[str]] = Nonedefaults; guardFieldFile.nameaccesses; annotate emptyfolder_map/structural_set_mapaccumulators.corpus_tasks: drop a stalereturn summaryfrom a-> Nonetask; coerce thestr | intCelery args tointonce at function entry; rename innerexecutionvariable to avoid shadowing the outer loop binding; scoped# type: ignore[attr-defined]on Celery's dynamically-attachedSignature.s/.siattributes.extract_orchestrator_tasks: early-return whenextract_idisNone; scoped# type: ignore[attr-defined]ontask_func.si.Cross-cutting type tightening
LabelLookupPythonType.{text_labels, doc_labels}narrowed fromdict[str | int, AnnotationLabelPythonType]todict[str, ...](the producer inutils.etl.build_label_lookupsalready keys bystr(pk)).build_label_lookups(corpus_id: ...)changed fromstrtointto match every call site.utils.importing.load_or_create_labelsandutils.importing.import_doc_annotationsnow acceptMappinginstead ofdict, so callers can passOpenContractDocExport/AnnotationLabelPythonTypeTypedDicts directly withoutcast().import_tasks._validate_sidecar_schemaand_apply_sidecar_annotationsacceptstr | Noneforsidecar_pathsince callers may pass a fallback that isNone.Verification
python -m mypy --config-file mypy.ini opencontractserver config→Success: no issues found in 1032 source files.config.settings.mypy.Test plan
opencontractserver/tasks/lines:grep -c "opencontractserver/tasks/" docs/typing/mypy_baseline.txt→0.mypy.inino longer contains any[mypy-opencontractserver.tasks.*]blocks.https://claude.ai/code/session_01Mze1ELoLJu4feUTYWFq7Wt
Generated by Claude Code