-
Notifications
You must be signed in to change notification settings - Fork 0
Feature/178 improve performance of fetching missing issues #187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/178 improve performance of fetching missing issues #187
Conversation
- Added multiple threads to fetching of missing issues. - Added multiple threads to record creation.
WalkthroughIntroduces parallelism in two areas: fetching missing parent issues in miner.py using ThreadPoolExecutor with repository pre-caching and robust error handling; and record construction in default_record_factory.py via a parallel builder that classifies and builds records, updating generation flow and logging. New helper methods and updated signatures are added. Changes
Sequence Diagram(s)sequenceDiagram
autonumber
participant Miner
participant Data
participant RepoCache as Repo Cache
participant Pool as ThreadPoolExecutor
participant API as GitHub API
Miner->>Data: Read parents_sub_issues, origin_issue_ids
Miner->>RepoCache: _fetch_all_repositories_in_cache(data)
Miner->>Pool: Submit workers for parents not in origin_issue_ids
loop For each parent
Pool->>RepoCache: Resolve org/repo (validate exists)
alt Repo missing/error
Pool-->>Miner: Signal error / skip
else Repo ok
Pool->>API: get_issue(org/repo, num)
alt Not found / error
Pool-->>Miner: Signal error / skip
else Issue fetched
Pool-->>Miner: Return (Issue, Repository) if criteria met<br/>(closed_at present, since bounds ok)
Pool-->>Miner: Mark for removal if not meeting criteria
end
end
end
Miner->>Data: Prune non-qualifying parent IDs
Miner-->>Miner: Collect dict[Issue, Repository] and return
sequenceDiagram
autonumber
participant Factory as DefaultRecordFactory
participant Data
participant Pool as ThreadPoolExecutor
participant Records as Records Map
Factory->>Data: Enumerate issues
Factory->>Pool: build_issue_records_parallel(...)
par Parallel classify/build
Pool-->>Records: Build hierarchy records
Pool-->>Records: Build sub-issue records (mark cross-repo when applicable)
Pool-->>Records: Build plain issue records
end
Factory->>Factory: Update self._records and registered issues
Factory-->>Factory: Log progress (info)
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
Pre-merge checks and finishing touches❌ Failed checks (2 warnings)
✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (6)
release_notes_generator/data/miner.py (4)
155-159: Consider makingmax_workersconfigurable.The default value of 8 workers is reasonable, but consider exposing this as a configuration parameter (e.g., via
ActionInputs) to allow users to tune parallelism based on their GitHub API rate limits and system resources.
174-182: Redundant check inshould_fetchfunction.Line 180 checks
if issue.closed_at and data.since > issue.closed_at, but line 176 already returnsFalseifnot issue.closed_at. The second check ofissue.closed_atis unnecessary.Apply this diff to simplify the logic:
def should_fetch(issue: Issue) -> bool: # Mirrors original logic if not issue.closed_at: return False if data.since: - # if since > closed_at => skip - if issue.closed_at and data.since > issue.closed_at: + if data.since > issue.closed_at: return False return True
184-213: Consider using a named tuple or dataclass for worker return.The worker function returns a 4-element tuple
(parent_id, issue, repo, error). A named tuple or dataclass would improve readability and make the code more maintainable.Example using a dataclass:
from dataclasses import dataclass @dataclass class FetchResult: parent_id: str issue: Optional[Issue] = None repo: Optional[Repository] = None error: Optional[str] = NoneThen update the worker signature:
-def worker(parent_id: str) -> tuple[str, Optional[Issue], Optional[Repository], Optional[str]]: +def worker(parent_id: str) -> FetchResult: """ - Returns (parent_id, issue|None, repo|None, error|None) + Returns FetchResult with appropriate fields set - issue=None & error=None => mark for remove (didn't meet criteria) - issue=None & error!=None => log error """ try: org, repo, num = parse_issue_id(parent_id) except Exception as e: # defensive - return (parent_id, None, None, f"parse_issue_id failed: {e}") + return FetchResult(parent_id, error=f"parse_issue_id failed: {e}")
246-250: Defensive type check may be unnecessary.The
isinstance(sub_issues, list)check assumesparents_sub_issuesvalues could be non-list, but according toMinedDatatype hints, it's defined asdict[str, list[str]]. This defensive check adds safety but may be redundant.If the type is guaranteed to be
list[str], you could simplify:for sub_issues in data.parents_sub_issues.values(): - # parents_sub_issues can be dict[str, list[str]] or now dict[str, str] per your later change; - # if it's list[str], this removal is ok; if changed to str, guard it. - if isinstance(sub_issues, list) and iid in sub_issues: + if iid in sub_issues: sub_issues.remove(iid)release_notes_generator/record/factory/default_record_factory.py (2)
74-78: Consider makingmax_workersconfigurable.The hardcoded
max_workers=8should be exposed as a configuration parameter to allow tuning based on system resources and API constraints, similar to the recommendation forminer.py.Consider adding to
ActionInputs:# In action_inputs.py def get_max_workers() -> int: """Get the maximum number of workers for parallel processing.""" user_input = get_action_input("MAX_WORKERS", "8") return int(user_input)Then use it:
-built = build_issue_records_parallel(self, data, max_workers=8) +built = build_issue_records_parallel(self, data, max_workers=ActionInputs.get_max_workers())
300-332: Improve parameter naming for clarity.The parameter
genis not descriptive. Consider renaming it tofactoryto better indicate its role as aDefaultRecordFactoryinstance.Apply this diff:
-def build_issue_records_parallel(gen, data, max_workers: int = 8) -> dict[str, "Record"]: +def build_issue_records_parallel(factory: "DefaultRecordFactory", data: MinedData, max_workers: int = 8) -> dict[str, "Record"]: """ - Build issue records in parallel with no side effects on `gen`. + Build issue records in parallel with no side effects on `factory`. Returns: {iid: Record} """ parents_sub_issues = data.parents_sub_issues # read-only snapshot for this phase all_sub_issue_ids = {iid for subs in parents_sub_issues.values() for iid in subs} issues_items = list(data.issues.items()) # snapshot def _classify_and_build(issue, repo) -> tuple[str, "Record"]: iid = get_id(issue, repo) # classification if len(parents_sub_issues.get(iid, [])) > 0: # hierarchy node (has sub-issues) - rec = gen.build_record_for_hierarchy_issue(issue) + rec = factory.build_record_for_hierarchy_issue(issue) elif iid in all_sub_issue_ids: # leaf sub-issue - rec = gen.build_record_for_sub_issue(issue, iid) + rec = factory.build_record_for_sub_issue(issue, iid) else: # plain issue - rec = gen.build_record_for_issue(issue) + rec = factory.build_record_for_issue(issue) return iid, rec
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
release_notes_generator/data/miner.py(3 hunks)release_notes_generator/record/factory/default_record_factory.py(3 hunks)
🧰 Additional context used
🧬 Code graph analysis (2)
release_notes_generator/data/miner.py (2)
release_notes_generator/model/mined_data.py (3)
MinedData(38-81)get_repository(67-72)add_repository(61-65)release_notes_generator/utils/record_utils.py (2)
parse_issue_id(54-62)get_id(24-45)
release_notes_generator/record/factory/default_record_factory.py (6)
release_notes_generator/utils/record_utils.py (1)
get_id(24-45)release_notes_generator/model/issue_record.py (2)
issue(63-68)IssueRecord(16-335)release_notes_generator/model/record.py (4)
Record(29-218)skip(73-75)is_cross_repo(55-61)is_cross_repo(64-70)release_notes_generator/action_inputs.py (1)
get_skip_release_notes_labels(209-217)release_notes_generator/model/hierarchy_issue_record.py (1)
HierarchyIssueRecord(16-169)release_notes_generator/model/sub_issue_record.py (1)
SubIssueRecord(15-26)
🪛 Ruff (0.13.3)
release_notes_generator/data/miner.py
192-192: Do not catch blind exception: Exception
(BLE001)
203-203: Do not catch blind exception: Exception
(BLE001)
226-226: Do not catch blind exception: Exception
(BLE001)
231-231: Comment contains ambiguous ’ (RIGHT SINGLE QUOTATION MARK). Did you mean ``` (GRAVE ACCENT)?
(RUF003)
🔇 Additional comments (11)
release_notes_generator/data/miner.py (5)
24-24: LGTM! Appropriate import for parallelization.The addition of
ThreadPoolExecutorandas_completedfromconcurrent.futuresaligns with the PR objective to introduce parallel processing for fetching missing issues.
100-101: LGTM! Smart pre-caching approach.Pre-fetching all repositories before parallel processing is a good strategy to avoid potential race conditions and redundant fetches during concurrent execution.
133-154: LGTM! Robust repository pre-caching implementation.The method correctly:
- Parses issue IDs from both keys and values of
parents_sub_issues- Checks cache before fetching
- Handles fetch failures gracefully with logging
- Avoids duplicate fetches
192-193: Bare exception catches are appropriate here.Static analysis flagged the bare
Exceptioncatches at lines 192, 203, and 226. However, these are justified:
- Line 192-193: Defensive parsing of potentially malformed issue IDs
- Line 203: Robust handling of GitHub API call failures (network, rate limits, etc.)
- Line 226: Catching any exception from future result retrieval
These broad catches ensure the parallel processing doesn't crash from individual item failures, which is the correct behavior for a data mining pipeline.
Based on static analysis hints.
Also applies to: 203-203, 226-226
220-257: LGTM! Well-structured parallel processing with proper error handling.The implementation correctly:
- Uses ThreadPoolExecutor with descriptive thread names
- Collects results and errors separately
- Applies mutations (removals) after parallel processing to avoid race conditions
- Logs comprehensive statistics
- Handles both successful fetches and failures gracefully
The deferred removal strategy (lines 243-250) is particularly good practice.
release_notes_generator/record/factory/default_record_factory.py (6)
22-22: LGTM! Appropriate import for parallelization.Adding
ThreadPoolExecutorsupports the parallel record building implementation, aligning with the PR objective to improve performance of record creation.
83-83: LGTM! Improved observability with info-level logging.Promoting these log statements from debug to info level provides better visibility into the record generation pipeline's progress, which is especially valuable for parallel processing.
Also applies to: 88-88, 92-92, 98-98
249-262: LGTM! Well-structured builder method for hierarchy issues.The method:
- Properly handles optional labels with fallback to type-based labels
- Correctly evaluates skip logic using
ActionInputs.get_skip_release_notes_labels()- Returns the appropriate
HierarchyIssueRecordtype
264-282: LGTM! Proper cross-repo detection logic.The method correctly:
- Builds a
SubIssueRecordwith appropriate skip logic- Detects cross-repo issues by comparing the repository full name from the issue ID (line 280)
- Sets the
is_cross_repoflag when applicableThe cross-repo detection preserves the original behavior as noted in the AI summary.
284-297: LGTM! Clean builder method for plain issues.The method follows the same pattern as the other builders, maintaining consistency across the codebase.
309-331: LGTM! Efficient parallel record building implementation.The parallel builder:
- Creates read-only snapshots to avoid race conditions
- Correctly classifies issues into hierarchy/sub-issue/plain categories
- Uses
ThreadPoolExecutor.mapfor clean parallel execution- Preserves all original logic while improving performance
The classification logic (lines 313-321) accurately replicates the sequential branching that was replaced.
Release Notes:
Summary by CodeRabbit
New Features
Performance
Reliability
Bug Fixes
Fixes #178
Fixes #179