Feature/178 improve performance of fetching missing issues #187

miroslavpojer · 2025-10-05T12:45:34Z

Release Notes:

Added multiple threads to fetching of missing issues.
Added multiple threads to record creation.

Summary by CodeRabbit

New Features
- Parallel processing for fetching sub-issues and building records.
- Cross-repo sub-issues are recognized and linked.
- More informative progress logs at info level.
Performance
- Faster syncing and record generation through concurrency and smarter caching of repositories.
Reliability
- Robust error handling prevents failures from stopping processing.
- Invalid or non-qualifying parent links are pruned to keep data clean.
- Ordering adjustments reduce race conditions.
Bug Fixes
- Avoids repeated fetches and stale references.
- Handles missing repositories or issues without crashing.

Fixes #178
Fixes #179

- Added multiple threads to fetching of missing issues. - Added multiple threads to record creation.

coderabbitai · 2025-10-05T12:45:44Z

Walkthrough

Introduces parallelism in two areas: fetching missing parent issues in miner.py using ThreadPoolExecutor with repository pre-caching and robust error handling; and record construction in default_record_factory.py via a parallel builder that classifies and builds records, updating generation flow and logging. New helper methods and updated signatures are added.

Changes

Cohort / File(s)	Change Summary
Parallel missing-issue fetching in miner `release_notes_generator/data/miner.py`	Replaces linear _fetch_missing_issues with concurrent fetching via ThreadPoolExecutor; adds repository pre-cache `_fetch_all_repositories_in_cache(data)`; introduces worker pipeline with validation, error handling, filtering by criteria, deferred removals, and returns dict[Issue, Repository]. Updates method signature: `_fetch_missing_issues(self, data: MinedData, max_workers: int = 8) -> dict[Issue, Repository]`.
Parallel record building in default record factory `release_notes_generator/record/factory/default_record_factory.py`	Adds parallel builder `build_issue_records_parallel(gen, data, max_workers: int = 8) -> dict[str, "Record"]`; refactors generation to use parallel record construction; introduces `build_record_for_hierarchy_issue`, `build_record_for_sub_issue`, `build_record_for_issue`; updates logging to info; removes/redirects legacy specialized creators.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Miner
  participant Data
  participant RepoCache as Repo Cache
  participant Pool as ThreadPoolExecutor
  participant API as GitHub API

  Miner->>Data: Read parents_sub_issues, origin_issue_ids
  Miner->>RepoCache: _fetch_all_repositories_in_cache(data)
  Miner->>Pool: Submit workers for parents not in origin_issue_ids
  loop For each parent
    Pool->>RepoCache: Resolve org/repo (validate exists)
    alt Repo missing/error
      Pool-->>Miner: Signal error / skip
    else Repo ok
      Pool->>API: get_issue(org/repo, num)
      alt Not found / error
        Pool-->>Miner: Signal error / skip
      else Issue fetched
        Pool-->>Miner: Return (Issue, Repository) if criteria met<br/>(closed_at present, since bounds ok)
        Pool-->>Miner: Mark for removal if not meeting criteria
      end
    end
  end
  Miner->>Data: Prune non-qualifying parent IDs
  Miner-->>Miner: Collect dict[Issue, Repository] and return

sequenceDiagram
  autonumber
  participant Factory as DefaultRecordFactory
  participant Data
  participant Pool as ThreadPoolExecutor
  participant Records as Records Map

  Factory->>Data: Enumerate issues
  Factory->>Pool: build_issue_records_parallel(...)
  par Parallel classify/build
    Pool-->>Records: Build hierarchy records
    Pool-->>Records: Build sub-issue records (mark cross-repo when applicable)
    Pool-->>Records: Build plain issue records
  end
  Factory->>Factory: Update self._records and registered issues
  Factory-->>Factory: Log progress (info)

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

#174 - Add logic for fetching cross-repo sub issues's PRs #186: Also modifies miner.py’s missing-issue fetching flow/signature and helpers; directly related to the new concurrent implementation.
Feature/170 add logic for fetching sub issues from external repositories #175: Implements cross-repo sub-issue fetching and repository resolution; aligns with miner’s repo validation and record factory’s cross-repo markings.

Suggested labels

enhancement

Suggested reviewers

Zejnilovic
benedeki

Poem

I thump my paws—threads race in a row,
Fetching parents swiftly, off we go!
Records sprout like clover in the sun,
Parallel paths, the warren’s run.
With caches packed and logs that sing,
Release-note baskets—spring! 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description Check	⚠️ Warning	The pull request description does not follow the repository’s required template because it lacks the entire checklist section, the Additional Context heading, and still contains the unused "{PR Summary}" placeholder, making it incomplete.	Please update the description to use the provided template by adding the checklist with completed or placeholder tasks, including the Additional Context section, and replacing the "{PR Summary}" placeholder with an actual summary of changes.
Docstring Coverage	⚠️ Warning	Docstring coverage is 61.11% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title succinctly highlights the primary change by stating the performance improvement for fetching missing issues, which directly aligns with the core objective of issue #178 without introducing unrelated details or unnecessary noise.
Linked Issues Check	✅ Passed	The changes introduce ThreadPoolExecutor-based parallel fetching of missing issues and parallel record creation, directly satisfying the objectives of linked issues #178 and #179 by replacing sequential flows with concurrent processing.
Out of Scope Changes Check	✅ Passed	All modifications in the pull request pertain exclusively to introducing parallelism in missing-issue retrieval and record building, with no unrelated or out-of-scope changes detected.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch feature/178-Improve-performance-of-fetching-missing-issues

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

release_notes_generator/data/miner.py (4)
155-159: Consider making max_workers configurable.

The default value of 8 workers is reasonable, but consider exposing this as a configuration parameter (e.g., via ActionInputs) to allow users to tune parallelism based on their GitHub API rate limits and system resources.

174-182: Redundant check in should_fetch function.

Line 180 checks if issue.closed_at and data.since > issue.closed_at, but line 176 already returns False if not issue.closed_at. The second check of issue.closed_at is unnecessary.

Apply this diff to simplify the logic:
 def should_fetch(issue: Issue) -> bool:
     # Mirrors original logic
     if not issue.closed_at:
         return False
     if data.since:
-        # if since > closed_at => skip
-        if issue.closed_at and data.since > issue.closed_at:
+        if data.since > issue.closed_at:
             return False
     return True
184-213: Consider using a named tuple or dataclass for worker return.

The worker function returns a 4-element tuple (parent_id, issue, repo, error). A named tuple or dataclass would improve readability and make the code more maintainable.

Example using a dataclass:
from dataclasses import dataclass

@dataclass
class FetchResult:
    parent_id: str
    issue: Optional[Issue] = None
    repo: Optional[Repository] = None
    error: Optional[str] = None
Then update the worker signature:
-def worker(parent_id: str) -> tuple[str, Optional[Issue], Optional[Repository], Optional[str]]:
+def worker(parent_id: str) -> FetchResult:
     """
-    Returns (parent_id, issue|None, repo|None, error|None)
+    Returns FetchResult with appropriate fields set
     - issue=None & error=None  => mark for remove (didn't meet criteria)
     - issue=None & error!=None => log error
     """
     try:
         org, repo, num = parse_issue_id(parent_id)
     except Exception as e:  # defensive
-        return (parent_id, None, None, f"parse_issue_id failed: {e}")
+        return FetchResult(parent_id, error=f"parse_issue_id failed: {e}")
246-250: Defensive type check may be unnecessary.

The isinstance(sub_issues, list) check assumes parents_sub_issues values could be non-list, but according to MinedData type hints, it's defined as dict[str, list[str]]. This defensive check adds safety but may be redundant.

If the type is guaranteed to be list[str], you could simplify:
 for sub_issues in data.parents_sub_issues.values():
-    # parents_sub_issues can be dict[str, list[str]] or now dict[str, str] per your later change;
-    # if it's list[str], this removal is ok; if changed to str, guard it.
-    if isinstance(sub_issues, list) and iid in sub_issues:
+    if iid in sub_issues:
         sub_issues.remove(iid)
release_notes_generator/record/factory/default_record_factory.py (2)
74-78: Consider making max_workers configurable.

The hardcoded max_workers=8 should be exposed as a configuration parameter to allow tuning based on system resources and API constraints, similar to the recommendation for miner.py.

Consider adding to ActionInputs:
# In action_inputs.py
def get_max_workers() -> int:
    """Get the maximum number of workers for parallel processing."""
    user_input = get_action_input("MAX_WORKERS", "8")
    return int(user_input)
Then use it:
-built = build_issue_records_parallel(self, data, max_workers=8)
+built = build_issue_records_parallel(self, data, max_workers=ActionInputs.get_max_workers())
300-332: Improve parameter naming for clarity.

The parameter gen is not descriptive. Consider renaming it to factory to better indicate its role as a DefaultRecordFactory instance.

Apply this diff:
-def build_issue_records_parallel(gen, data, max_workers: int = 8) -> dict[str, "Record"]:
+def build_issue_records_parallel(factory: "DefaultRecordFactory", data: MinedData, max_workers: int = 8) -> dict[str, "Record"]:
     """
-    Build issue records in parallel with no side effects on `gen`.
+    Build issue records in parallel with no side effects on `factory`.
     Returns: {iid: Record}
     """
     parents_sub_issues = data.parents_sub_issues  # read-only snapshot for this phase
     all_sub_issue_ids = {iid for subs in parents_sub_issues.values() for iid in subs}
     issues_items = list(data.issues.items())  # snapshot

     def _classify_and_build(issue, repo) -> tuple[str, "Record"]:
         iid = get_id(issue, repo)

         # classification
         if len(parents_sub_issues.get(iid, [])) > 0:
             # hierarchy node (has sub-issues)
-            rec = gen.build_record_for_hierarchy_issue(issue)
+            rec = factory.build_record_for_hierarchy_issue(issue)
         elif iid in all_sub_issue_ids:
             # leaf sub-issue
-            rec = gen.build_record_for_sub_issue(issue, iid)
+            rec = factory.build_record_for_sub_issue(issue, iid)
         else:
             # plain issue
-            rec = gen.build_record_for_issue(issue)
+            rec = factory.build_record_for_issue(issue)
         return iid, rec

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2fd72de and 5d8756a.

📒 Files selected for processing (2)

release_notes_generator/data/miner.py (3 hunks)
release_notes_generator/record/factory/default_record_factory.py (3 hunks)

🧰 Additional context used

🧬 Code graph analysis (2)

release_notes_generator/data/miner.py (2)

release_notes_generator/model/mined_data.py (3)

MinedData (38-81)

get_repository (67-72)

add_repository (61-65)

release_notes_generator/utils/record_utils.py (2)

parse_issue_id (54-62)

get_id (24-45)

release_notes_generator/record/factory/default_record_factory.py (6)

release_notes_generator/utils/record_utils.py (1)

get_id (24-45)

release_notes_generator/model/issue_record.py (2)

issue (63-68)

IssueRecord (16-335)

release_notes_generator/model/record.py (4)

Record (29-218)

skip (73-75)

is_cross_repo (55-61)

is_cross_repo (64-70)

release_notes_generator/action_inputs.py (1)

get_skip_release_notes_labels (209-217)

release_notes_generator/model/hierarchy_issue_record.py (1)

HierarchyIssueRecord (16-169)

release_notes_generator/model/sub_issue_record.py (1)

SubIssueRecord (15-26)

🪛 Ruff (0.13.3)

release_notes_generator/data/miner.py

192-192: Do not catch blind exception: Exception

(BLE001)

203-203: Do not catch blind exception: Exception

(BLE001)

226-226: Do not catch blind exception: Exception

(BLE001)

231-231: Comment contains ambiguous ’ (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?

(RUF003)

🔇 Additional comments (11)

release_notes_generator/data/miner.py (5)

24-24: LGTM! Appropriate import for parallelization.

The addition of ThreadPoolExecutor and as_completed from concurrent.futures aligns with the PR objective to introduce parallel processing for fetching missing issues.

100-101: LGTM! Smart pre-caching approach.

Pre-fetching all repositories before parallel processing is a good strategy to avoid potential race conditions and redundant fetches during concurrent execution.

133-154: LGTM! Robust repository pre-caching implementation.

The method correctly:

Parses issue IDs from both keys and values of parents_sub_issues

Checks cache before fetching

Handles fetch failures gracefully with logging

Avoids duplicate fetches

192-193: Bare exception catches are appropriate here.

Static analysis flagged the bare Exception catches at lines 192, 203, and 226. However, these are justified:

Line 192-193: Defensive parsing of potentially malformed issue IDs

Line 203: Robust handling of GitHub API call failures (network, rate limits, etc.)

Line 226: Catching any exception from future result retrieval

These broad catches ensure the parallel processing doesn't crash from individual item failures, which is the correct behavior for a data mining pipeline.

Based on static analysis hints.

Also applies to: 203-203, 226-226

220-257: LGTM! Well-structured parallel processing with proper error handling.

The implementation correctly:

Uses ThreadPoolExecutor with descriptive thread names

Collects results and errors separately

Applies mutations (removals) after parallel processing to avoid race conditions

Logs comprehensive statistics

Handles both successful fetches and failures gracefully

The deferred removal strategy (lines 243-250) is particularly good practice.

release_notes_generator/record/factory/default_record_factory.py (6)

22-22: LGTM! Appropriate import for parallelization.

Adding ThreadPoolExecutor supports the parallel record building implementation, aligning with the PR objective to improve performance of record creation.

83-83: LGTM! Improved observability with info-level logging.

Promoting these log statements from debug to info level provides better visibility into the record generation pipeline's progress, which is especially valuable for parallel processing.

Also applies to: 88-88, 92-92, 98-98

249-262: LGTM! Well-structured builder method for hierarchy issues.

The method:

Properly handles optional labels with fallback to type-based labels

Correctly evaluates skip logic using ActionInputs.get_skip_release_notes_labels()

Returns the appropriate HierarchyIssueRecord type

264-282: LGTM! Proper cross-repo detection logic.

The method correctly:

Builds a SubIssueRecord with appropriate skip logic

Detects cross-repo issues by comparing the repository full name from the issue ID (line 280)

Sets the is_cross_repo flag when applicable

The cross-repo detection preserves the original behavior as noted in the AI summary.

284-297: LGTM! Clean builder method for plain issues.

The method follows the same pattern as the other builders, maintaining consistency across the codebase.

309-331: LGTM! Efficient parallel record building implementation.

The parallel builder:

Creates read-only snapshots to avoid race conditions

Correctly classifies issues into hierarchy/sub-issue/plain categories

Uses ThreadPoolExecutor.map for clean parallel execution

Preserves all original logic while improving performance

The classification logic (lines 313-321) accurately replicates the sequential branching that was replaced.

miroslavpojer added 2 commits October 5, 2025 14:16

#178 - Improve performance of fetching missing issues

7b3777f

- Added multiple threads to fetching of missing issues. - Added multiple threads to record creation.

Local fix of checkers and linters.

5d8756a

miroslavpojer self-assigned this Oct 5, 2025

miroslavpojer requested a review from Zejnilovic as a code owner October 5, 2025 12:45

coderabbitai bot reviewed Oct 5, 2025

View reviewed changes

miroslavpojer merged commit 6e7c10c into master Oct 5, 2025
6 of 7 checks passed

miroslavpojer deleted the feature/178-Improve-performance-of-fetching-missing-issues branch October 5, 2025 12:51

coderabbitai bot mentioned this pull request Oct 8, 2025

Feature/182 review of unit tests #192

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/178 improve performance of fetching missing issues #187

Feature/178 improve performance of fetching missing issues #187

Uh oh!

miroslavpojer commented Oct 5, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Oct 5, 2025 •

edited

Loading

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Feature/178 improve performance of fetching missing issues #187

Feature/178 improve performance of fetching missing issues #187

Uh oh!

Conversation

miroslavpojer commented Oct 5, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Oct 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested labels

Suggested reviewers

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

miroslavpojer commented Oct 5, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Oct 5, 2025 •

edited

Loading