Skip to content

Conversation

@miroslavpojer
Copy link
Collaborator

@miroslavpojer miroslavpojer commented Oct 5, 2025

Release Notes:

  • Added multiple threads to fetching of missing issues.
  • Added multiple threads to record creation.

Summary by CodeRabbit

  • New Features

    • Parallel processing for fetching sub-issues and building records.
    • Cross-repo sub-issues are recognized and linked.
    • More informative progress logs at info level.
  • Performance

    • Faster syncing and record generation through concurrency and smarter caching of repositories.
  • Reliability

    • Robust error handling prevents failures from stopping processing.
    • Invalid or non-qualifying parent links are pruned to keep data clean.
    • Ordering adjustments reduce race conditions.
  • Bug Fixes

    • Avoids repeated fetches and stale references.
    • Handles missing repositories or issues without crashing.

Fixes #178
Fixes #179

- Added multiple threads to fetching of missing issues.
- Added multiple threads to record creation.
@miroslavpojer miroslavpojer self-assigned this Oct 5, 2025
@coderabbitai
Copy link

coderabbitai bot commented Oct 5, 2025

Walkthrough

Introduces parallelism in two areas: fetching missing parent issues in miner.py using ThreadPoolExecutor with repository pre-caching and robust error handling; and record construction in default_record_factory.py via a parallel builder that classifies and builds records, updating generation flow and logging. New helper methods and updated signatures are added.

Changes

Cohort / File(s) Change Summary
Parallel missing-issue fetching in miner
release_notes_generator/data/miner.py
Replaces linear _fetch_missing_issues with concurrent fetching via ThreadPoolExecutor; adds repository pre-cache _fetch_all_repositories_in_cache(data); introduces worker pipeline with validation, error handling, filtering by criteria, deferred removals, and returns dict[Issue, Repository]. Updates method signature: _fetch_missing_issues(self, data: MinedData, max_workers: int = 8) -> dict[Issue, Repository].
Parallel record building in default record factory
release_notes_generator/record/factory/default_record_factory.py
Adds parallel builder build_issue_records_parallel(gen, data, max_workers: int = 8) -> dict[str, "Record"]; refactors generation to use parallel record construction; introduces build_record_for_hierarchy_issue, build_record_for_sub_issue, build_record_for_issue; updates logging to info; removes/redirects legacy specialized creators.

Sequence Diagram(s)

sequenceDiagram
  autonumber
  participant Miner
  participant Data
  participant RepoCache as Repo Cache
  participant Pool as ThreadPoolExecutor
  participant API as GitHub API

  Miner->>Data: Read parents_sub_issues, origin_issue_ids
  Miner->>RepoCache: _fetch_all_repositories_in_cache(data)
  Miner->>Pool: Submit workers for parents not in origin_issue_ids
  loop For each parent
    Pool->>RepoCache: Resolve org/repo (validate exists)
    alt Repo missing/error
      Pool-->>Miner: Signal error / skip
    else Repo ok
      Pool->>API: get_issue(org/repo, num)
      alt Not found / error
        Pool-->>Miner: Signal error / skip
      else Issue fetched
        Pool-->>Miner: Return (Issue, Repository) if criteria met<br/>(closed_at present, since bounds ok)
        Pool-->>Miner: Mark for removal if not meeting criteria
      end
    end
  end
  Miner->>Data: Prune non-qualifying parent IDs
  Miner-->>Miner: Collect dict[Issue, Repository] and return
Loading
sequenceDiagram
  autonumber
  participant Factory as DefaultRecordFactory
  participant Data
  participant Pool as ThreadPoolExecutor
  participant Records as Records Map

  Factory->>Data: Enumerate issues
  Factory->>Pool: build_issue_records_parallel(...)
  par Parallel classify/build
    Pool-->>Records: Build hierarchy records
    Pool-->>Records: Build sub-issue records (mark cross-repo when applicable)
    Pool-->>Records: Build plain issue records
  end
  Factory->>Factory: Update self._records and registered issues
  Factory-->>Factory: Log progress (info)
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

Suggested labels

enhancement

Suggested reviewers

  • Zejnilovic
  • benedeki

Poem

I thump my paws—threads race in a row,
Fetching parents swiftly, off we go!
Records sprout like clover in the sun,
Parallel paths, the warren’s run.
With caches packed and logs that sing,
Release-note baskets—spring! 🥕✨

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings)
Check name Status Explanation Resolution
Description Check ⚠️ Warning The pull request description does not follow the repository’s required template because it lacks the entire checklist section, the Additional Context heading, and still contains the unused "{PR Summary}" placeholder, making it incomplete. Please update the description to use the provided template by adding the checklist with completed or placeholder tasks, including the Additional Context section, and replacing the "{PR Summary}" placeholder with an actual summary of changes.
Docstring Coverage ⚠️ Warning Docstring coverage is 61.11% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (3 passed)
Check name Status Explanation
Title Check ✅ Passed The title succinctly highlights the primary change by stating the performance improvement for fetching missing issues, which directly aligns with the core objective of issue #178 without introducing unrelated details or unnecessary noise.
Linked Issues Check ✅ Passed The changes introduce ThreadPoolExecutor-based parallel fetching of missing issues and parallel record creation, directly satisfying the objectives of linked issues #178 and #179 by replacing sequential flows with concurrent processing.
Out of Scope Changes Check ✅ Passed All modifications in the pull request pertain exclusively to introducing parallelism in missing-issue retrieval and record building, with no unrelated or out-of-scope changes detected.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch feature/178-Improve-performance-of-fetching-missing-issues

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (6)
release_notes_generator/data/miner.py (4)

155-159: Consider making max_workers configurable.

The default value of 8 workers is reasonable, but consider exposing this as a configuration parameter (e.g., via ActionInputs) to allow users to tune parallelism based on their GitHub API rate limits and system resources.


174-182: Redundant check in should_fetch function.

Line 180 checks if issue.closed_at and data.since > issue.closed_at, but line 176 already returns False if not issue.closed_at. The second check of issue.closed_at is unnecessary.

Apply this diff to simplify the logic:

 def should_fetch(issue: Issue) -> bool:
     # Mirrors original logic
     if not issue.closed_at:
         return False
     if data.since:
-        # if since > closed_at => skip
-        if issue.closed_at and data.since > issue.closed_at:
+        if data.since > issue.closed_at:
             return False
     return True

184-213: Consider using a named tuple or dataclass for worker return.

The worker function returns a 4-element tuple (parent_id, issue, repo, error). A named tuple or dataclass would improve readability and make the code more maintainable.

Example using a dataclass:

from dataclasses import dataclass

@dataclass
class FetchResult:
    parent_id: str
    issue: Optional[Issue] = None
    repo: Optional[Repository] = None
    error: Optional[str] = None

Then update the worker signature:

-def worker(parent_id: str) -> tuple[str, Optional[Issue], Optional[Repository], Optional[str]]:
+def worker(parent_id: str) -> FetchResult:
     """
-    Returns (parent_id, issue|None, repo|None, error|None)
+    Returns FetchResult with appropriate fields set
     - issue=None & error=None  => mark for remove (didn't meet criteria)
     - issue=None & error!=None => log error
     """
     try:
         org, repo, num = parse_issue_id(parent_id)
     except Exception as e:  # defensive
-        return (parent_id, None, None, f"parse_issue_id failed: {e}")
+        return FetchResult(parent_id, error=f"parse_issue_id failed: {e}")

246-250: Defensive type check may be unnecessary.

The isinstance(sub_issues, list) check assumes parents_sub_issues values could be non-list, but according to MinedData type hints, it's defined as dict[str, list[str]]. This defensive check adds safety but may be redundant.

If the type is guaranteed to be list[str], you could simplify:

 for sub_issues in data.parents_sub_issues.values():
-    # parents_sub_issues can be dict[str, list[str]] or now dict[str, str] per your later change;
-    # if it's list[str], this removal is ok; if changed to str, guard it.
-    if isinstance(sub_issues, list) and iid in sub_issues:
+    if iid in sub_issues:
         sub_issues.remove(iid)
release_notes_generator/record/factory/default_record_factory.py (2)

74-78: Consider making max_workers configurable.

The hardcoded max_workers=8 should be exposed as a configuration parameter to allow tuning based on system resources and API constraints, similar to the recommendation for miner.py.

Consider adding to ActionInputs:

# In action_inputs.py
def get_max_workers() -> int:
    """Get the maximum number of workers for parallel processing."""
    user_input = get_action_input("MAX_WORKERS", "8")
    return int(user_input)

Then use it:

-built = build_issue_records_parallel(self, data, max_workers=8)
+built = build_issue_records_parallel(self, data, max_workers=ActionInputs.get_max_workers())

300-332: Improve parameter naming for clarity.

The parameter gen is not descriptive. Consider renaming it to factory to better indicate its role as a DefaultRecordFactory instance.

Apply this diff:

-def build_issue_records_parallel(gen, data, max_workers: int = 8) -> dict[str, "Record"]:
+def build_issue_records_parallel(factory: "DefaultRecordFactory", data: MinedData, max_workers: int = 8) -> dict[str, "Record"]:
     """
-    Build issue records in parallel with no side effects on `gen`.
+    Build issue records in parallel with no side effects on `factory`.
     Returns: {iid: Record}
     """
     parents_sub_issues = data.parents_sub_issues  # read-only snapshot for this phase
     all_sub_issue_ids = {iid for subs in parents_sub_issues.values() for iid in subs}
     issues_items = list(data.issues.items())  # snapshot

     def _classify_and_build(issue, repo) -> tuple[str, "Record"]:
         iid = get_id(issue, repo)

         # classification
         if len(parents_sub_issues.get(iid, [])) > 0:
             # hierarchy node (has sub-issues)
-            rec = gen.build_record_for_hierarchy_issue(issue)
+            rec = factory.build_record_for_hierarchy_issue(issue)
         elif iid in all_sub_issue_ids:
             # leaf sub-issue
-            rec = gen.build_record_for_sub_issue(issue, iid)
+            rec = factory.build_record_for_sub_issue(issue, iid)
         else:
             # plain issue
-            rec = gen.build_record_for_issue(issue)
+            rec = factory.build_record_for_issue(issue)
         return iid, rec
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2fd72de and 5d8756a.

📒 Files selected for processing (2)
  • release_notes_generator/data/miner.py (3 hunks)
  • release_notes_generator/record/factory/default_record_factory.py (3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
release_notes_generator/data/miner.py (2)
release_notes_generator/model/mined_data.py (3)
  • MinedData (38-81)
  • get_repository (67-72)
  • add_repository (61-65)
release_notes_generator/utils/record_utils.py (2)
  • parse_issue_id (54-62)
  • get_id (24-45)
release_notes_generator/record/factory/default_record_factory.py (6)
release_notes_generator/utils/record_utils.py (1)
  • get_id (24-45)
release_notes_generator/model/issue_record.py (2)
  • issue (63-68)
  • IssueRecord (16-335)
release_notes_generator/model/record.py (4)
  • Record (29-218)
  • skip (73-75)
  • is_cross_repo (55-61)
  • is_cross_repo (64-70)
release_notes_generator/action_inputs.py (1)
  • get_skip_release_notes_labels (209-217)
release_notes_generator/model/hierarchy_issue_record.py (1)
  • HierarchyIssueRecord (16-169)
release_notes_generator/model/sub_issue_record.py (1)
  • SubIssueRecord (15-26)
🪛 Ruff (0.13.3)
release_notes_generator/data/miner.py

192-192: Do not catch blind exception: Exception

(BLE001)


203-203: Do not catch blind exception: Exception

(BLE001)


226-226: Do not catch blind exception: Exception

(BLE001)


231-231: Comment contains ambiguous (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?

(RUF003)

🔇 Additional comments (11)
release_notes_generator/data/miner.py (5)

24-24: LGTM! Appropriate import for parallelization.

The addition of ThreadPoolExecutor and as_completed from concurrent.futures aligns with the PR objective to introduce parallel processing for fetching missing issues.


100-101: LGTM! Smart pre-caching approach.

Pre-fetching all repositories before parallel processing is a good strategy to avoid potential race conditions and redundant fetches during concurrent execution.


133-154: LGTM! Robust repository pre-caching implementation.

The method correctly:

  • Parses issue IDs from both keys and values of parents_sub_issues
  • Checks cache before fetching
  • Handles fetch failures gracefully with logging
  • Avoids duplicate fetches

192-193: Bare exception catches are appropriate here.

Static analysis flagged the bare Exception catches at lines 192, 203, and 226. However, these are justified:

  • Line 192-193: Defensive parsing of potentially malformed issue IDs
  • Line 203: Robust handling of GitHub API call failures (network, rate limits, etc.)
  • Line 226: Catching any exception from future result retrieval

These broad catches ensure the parallel processing doesn't crash from individual item failures, which is the correct behavior for a data mining pipeline.

Based on static analysis hints.

Also applies to: 203-203, 226-226


220-257: LGTM! Well-structured parallel processing with proper error handling.

The implementation correctly:

  • Uses ThreadPoolExecutor with descriptive thread names
  • Collects results and errors separately
  • Applies mutations (removals) after parallel processing to avoid race conditions
  • Logs comprehensive statistics
  • Handles both successful fetches and failures gracefully

The deferred removal strategy (lines 243-250) is particularly good practice.

release_notes_generator/record/factory/default_record_factory.py (6)

22-22: LGTM! Appropriate import for parallelization.

Adding ThreadPoolExecutor supports the parallel record building implementation, aligning with the PR objective to improve performance of record creation.


83-83: LGTM! Improved observability with info-level logging.

Promoting these log statements from debug to info level provides better visibility into the record generation pipeline's progress, which is especially valuable for parallel processing.

Also applies to: 88-88, 92-92, 98-98


249-262: LGTM! Well-structured builder method for hierarchy issues.

The method:

  • Properly handles optional labels with fallback to type-based labels
  • Correctly evaluates skip logic using ActionInputs.get_skip_release_notes_labels()
  • Returns the appropriate HierarchyIssueRecord type

264-282: LGTM! Proper cross-repo detection logic.

The method correctly:

  • Builds a SubIssueRecord with appropriate skip logic
  • Detects cross-repo issues by comparing the repository full name from the issue ID (line 280)
  • Sets the is_cross_repo flag when applicable

The cross-repo detection preserves the original behavior as noted in the AI summary.


284-297: LGTM! Clean builder method for plain issues.

The method follows the same pattern as the other builders, maintaining consistency across the codebase.


309-331: LGTM! Efficient parallel record building implementation.

The parallel builder:

  • Creates read-only snapshots to avoid race conditions
  • Correctly classifies issues into hierarchy/sub-issue/plain categories
  • Uses ThreadPoolExecutor.map for clean parallel execution
  • Preserves all original logic while improving performance

The classification logic (lines 313-321) accurately replicates the sequential branching that was replaced.

@miroslavpojer miroslavpojer merged commit 6e7c10c into master Oct 5, 2025
6 of 7 checks passed
@miroslavpojer miroslavpojer deleted the feature/178-Improve-performance-of-fetching-missing-issues branch October 5, 2025 12:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve performance of inner records creation Improve performance of fetching missing issues

2 participants