-
Notifications
You must be signed in to change notification settings - Fork 15
Matas/fix/fix actions #43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…stration in Acknowledgment and Engine Agents
📝 WalkthroughWalkthroughRefactors repository analysis into a LangGraph-driven pipeline with AI-based rule generation, adds anonymous/authenticated API access and in-memory rate limiting, expands GitHub integrations (REST + GraphQL + JWT), redesigns webhook dispatching with a task queue, and modernizes tooling and tests. Changes
Sequence Diagram(s)sequenceDiagram
participant Client
participant API as FastAPI /api/v1/recommend
participant Auth as get_current_user_optional
participant RateLimit
participant Agent as RepositoryAnalysisAgent
participant LangGraph
participant GitHubAPI
participant LLM as OpenAI
Client->>API: POST /api/v1/rules/recommend (repo_url, token?)
API->>Auth: parse token (optional)
Auth-->>API: User | None
API->>RateLimit: rate_limiter(request, user)
RateLimit-->>API: Allowed / 429
alt Allowed
API->>Agent: execute(repo_full_name, user/token)
Agent->>LangGraph: run graph (fetch_metadata -> fetch_pr_signals -> generate_rules)
LangGraph->>GitHubAPI: list files / fetch PRs / metadata
GitHubAPI-->>LangGraph: metadata, PR signals
LangGraph->>LLM: generate rule recommendations
LLM-->>LangGraph: recommendations
LangGraph-->>Agent: AnalysisState
Agent-->>API: AnalysisResponse
API-->>Client: 200 YAML + PR plan + report
else Rate limited
API-->>Client: 429 Retry-After
end
sequenceDiagram
participant GitHub as Webhook sender
participant Router as Webhook Router
participant Dispatcher
participant TaskQueue
participant Worker
participant Handler
GitHub->>Router: POST /webhooks (signed payload)
Router->>Router: verify signature & validate payload
Router->>Dispatcher: dispatch(WebhookEvent)
Dispatcher->>Dispatcher: find handler
Dispatcher->>TaskQueue: enqueue(handler, event)
TaskQueue->>TaskQueue: compute task_id (SHA-256) & deduplicate
alt Duplicate
TaskQueue-->>Dispatcher: enqueue returns false
else New
TaskQueue-->>Dispatcher: enqueue returns true
end
Router-->>GitHub: 202 Accepted / WebhookResponse
Worker->>TaskQueue: get task
TaskQueue-->>Worker: Task
Worker->>Handler: invoke handler(event)
Handler-->>Worker: WebhookResponse / result
Worker->>Worker: log completion
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (1 warning, 1 inconclusive)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Codecov Report❌ Patch coverage is ❌ Your patch status has failed because the patch coverage (65.1%) is below the target coverage (80.0%). You can increase the patch coverage or adjust the target coverage. @@ Coverage Diff @@
## main #43 +/- ##
========================================
+ Coverage 32.8% 46.7% +13.8%
========================================
Files 85 112 +27
Lines 5162 6713 +1551
========================================
+ Hits 1698 3138 +1440
- Misses 3464 3575 +111 Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 12
Note
Due to the large number of review comments, Critical, Major severity comments were prioritized as inline comments.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (5)
src/rules/utils/contributors.py (2)
55-55: Timezone-naive datetime will cause TypeError when compared.
cutoff_dateis timezone-naive, but in_has_recent_activity(lines 196-197, 206-207), it's compared against timezone-aware datetimes parsed from GitHub's ISO timestamps. Python 3 raisesTypeError: can't compare offset-naive and offset-aware datetimes.🐛 Proposed fix
- cutoff_date = datetime.now() - timedelta(days=days_back) + cutoff_date = datetime.now(tz=timezone.utc) - timedelta(days=days_back)Also update the import at line 9:
-from datetime import datetime, timedelta +from datetime import datetime, timedelta, timezone
247-252: Singleton ignoresgithub_clientparameter after first initialization.Once
_contributor_analyzeris created, subsequent calls with a differentgithub_clientwill silently return the analyzer bound to the original client. This could cause authentication or tenant isolation issues.♻️ Suggested fix: validate or key by client
Option 1 - Always use the provided client:
def get_contributor_analyzer(github_client) -> ContributorAnalyzer: """Get or create the global contributor analyzer instance.""" global _contributor_analyzer - if _contributor_analyzer is None: - _contributor_analyzer = ContributorAnalyzer(github_client) + if _contributor_analyzer is None or _contributor_analyzer.github_client is not github_client: + _contributor_analyzer = ContributorAnalyzer(github_client) return _contributor_analyzerOption 2 - Remove global singleton and always create fresh instance (simpler, cache still works per-instance):
def get_contributor_analyzer(github_client) -> ContributorAnalyzer: - """Get or create the global contributor analyzer instance.""" - global _contributor_analyzer - if _contributor_analyzer is None: - _contributor_analyzer = ContributorAnalyzer(github_client) - return _contributor_analyzer + """Create a contributor analyzer instance.""" + return ContributorAnalyzer(github_client)src/webhooks/handlers/deployment_status.py (1)
36-42:enqueuecall no longer matches TaskQueue API (missing handler + return type).
TaskQueue.enqueuenow requires a handler function and returnsbool, but this call passes nofuncand treats the return astask_id. This will raise at runtime and the response will be wrong even after fixing the call. Align the handler to the new API (or change TaskQueue to return a task_id).src/main.py (2)
61-101: Fix handler registration key type (Enum vs str).
WebhookDispatcher.dispatchlooks up handlers by string event type. Registering withEventTypeenums will not match and will drop all events as “handler_not_found”.Proposed fix (normalize to string keys)
- dispatcher.register_handler(EventType.PULL_REQUEST, pull_request_handler.handle) - dispatcher.register_handler(EventType.PUSH, push_handler.handle) - dispatcher.register_handler(EventType.CHECK_RUN, check_run_handler.handle) - dispatcher.register_handler(EventType.ISSUE_COMMENT, issue_comment_handler.handle) - dispatcher.register_handler(EventType.DEPLOYMENT, deployment_handler.handle) - dispatcher.register_handler(EventType.DEPLOYMENT_STATUS, deployment_status_handler.handle) - dispatcher.register_handler(EventType.DEPLOYMENT_REVIEW, deployment_review_handler.handle) - dispatcher.register_handler(EventType.DEPLOYMENT_PROTECTION_RULE, deployment_protection_rule_handler.handle) + dispatcher.register_handler(EventType.PULL_REQUEST.value, pull_request_handler.handle) + dispatcher.register_handler(EventType.PUSH.value, push_handler.handle) + dispatcher.register_handler(EventType.CHECK_RUN.value, check_run_handler.handle) + dispatcher.register_handler(EventType.ISSUE_COMMENT.value, issue_comment_handler.handle) + dispatcher.register_handler(EventType.DEPLOYMENT.value, deployment_handler.handle) + dispatcher.register_handler(EventType.DEPLOYMENT_STATUS.value, deployment_status_handler.handle) + dispatcher.register_handler(EventType.DEPLOYMENT_REVIEW.value, deployment_review_handler.handle) + dispatcher.register_handler(EventType.DEPLOYMENT_PROTECTION_RULE.value, deployment_protection_rule_handler.handle)
144-158: TaskQueue does not expose atasksattribute; endpoint will crash with AttributeError.The
health_tasks()function accessestask_queue.tasks.values()(line 144), butTaskQueueonly exposesqueue,processed_hashes, andworkers. Additionally, theTaskmodel lacks astatusfield, so even iftasksexisted, the filtering logic would fail.
🤖 Fix all issues with AI agents
In @.pre-commit-config.yaml:
- Around line 17-18: The pre-commit entry for the Ruff hook is pinned to an
outdated revision (repo: https://github.com/astral-sh/ruff-pre-commit with rev:
v0.3.0); update the rev to a current release (preferably v0.14.14, or at minimum
v0.14.13) in the .pre-commit-config.yaml and then run pre-commit autoupdate or
reinstall hooks to ensure the newer Ruff version is used.
In `@src/agents/repository_analysis_agent/nodes.py`:
- Around line 282-295: In the loop over pr_nodes in nodes.py (the block
computing codeowner_bypass_rate), avoid directly indexing
review["author"]["login"] since review["author"] may be None; instead retrieve
the reviewer login defensively (e.g., reviewer = review.get("author") and
reviewer_login = reviewer.get("login") if reviewer else None) and use that
reviewer_login in the approval check (compare against the existing author
variable which may also be None). Update the approved condition in the reviews
iteration to first handle a missing author object and only consider reviews with
state "APPROVED" and a non-null reviewer_login that differs from pr author.
In `@src/api/dependencies.py`:
- Around line 23-41: The get_current_user_optional function currently accepts
any Bearer token and returns a fake User, enabling auth bypass; replace this by
validating the token before constructing a User: call a token verification
helper (e.g., verify_token or integrate with your IdP client) from
get_current_user_optional to validate signature and claims, reject
invalid/malformed tokens by returning None and logging the failure, and only
create and return a User (or map claims to User) when verification succeeds;
ensure exceptions from verification are caught, logged via logger.warning, and
result in None instead of a fake authenticated user.
In `@src/integrations/github/api.py`:
- Around line 1020-1024: fetch_recent_pull_requests calls a non-existent helper
self._detect_issue_references(...) which will raise AttributeError; add a
private method _detect_issue_references(self, body, title) to the GitHubClient
class that returns a boolean by scanning the PR body and title for issue
reference patterns (e.g. plain `#123`, owner/repo#123, or keywords like
"fixes|closes|resolves" followed by an issue ref) using a safe regex and ensure
it is used by fetch_recent_pull_requests; keep the method name exactly
_detect_issue_references so the existing call resolves.
In `@src/integrations/github/graphql_client.py`:
- Around line 89-94: The list comprehension building Commit instances can crash
when GraphQL returns author: null because node["commit"].get("author", {}) will
be None; update the Commit construction (where Commit(...) is created) to first
coerce the author dict to a safe mapping (e.g., author_info =
node["commit"].get("author") or {}) and then use author_info.get("name",
"Unknown") for the author field so it never raises an AttributeError.
In `@src/main.py`:
- Around line 111-119: The CORS middleware is configured with
allow_origins=["*"] and allow_credentials=True which is invalid for wildcard
origins; update the app.add_middleware(CORSMiddleware, ...) configuration by
removing or setting allow_credentials to False when using allow_origins=["*"],
or alternatively replace allow_origins=["*"] with an explicit list of allowed
origins and keep allow_credentials=True; modify the parameters on the
CORSMiddleware invocation (app.add_middleware) accordingly.
In `@src/rules/validators.py`:
- Around line 238-243: The validate method in has_min_approals currently returns
True unconditionally; change it to read min_approvals =
parameters.get("min_approvals", 1), obtain the PR identifier from event (e.g.
event["pull_request"]["number"] or event["issue"]["number"] depending on
payload), fetch the PR reviews using the repository/GitHub client available in
the validator (or use event["reviews"] if the payload includes them), count
reviews with state "APPROVED" (ignoring duplicates by reviewer if desired) and
return True only if approved_count >= min_approvals; update the async validate
method to perform this logic and surface errors rather than always returning
True.
In `@src/tasks/task_queue.py`:
- Around line 43-61: The current enqueue method adds task_id to processed_hashes
immediately which both leaks memory and prevents retries on failure; change the
behavior so enqueue (in enqueue and using _generate_task_id, Task, queue.put)
only checks processed_hashes but does NOT add the id—move the addition to the
successful completion path in the worker/executor that runs the Task (i.e.,
after the task.func finishes without error) so failed tasks can be retried. Also
replace or augment processed_hashes with a TTL/evicting structure (e.g., an
async-safe TTL cache or an OrderedDict with timestamp eviction) to bound memory
growth and periodically prune expired entries; ensure all access to
processed_hashes is concurrency-safe in the async executor and enqueue.
In `@src/webhooks/auth.py`:
- Around line 28-30: The current call logger.info("request_headers_received",
headers=dict(request.headers)) exposes sensitive headers; either remove this
debug logging entirely or change it to logger.debug and sanitize headers first
by filtering out Authorization, Cookie, Set-Cookie, Proxy-Authorization, and any
custom API key headers from request.headers before logging; update the code
around the logger.info call (the request_headers_received logging) to build a
safe headers dict that masks or omits those sensitive keys and log that instead.
In `@src/webhooks/handlers/deployment.py`:
- Around line 36-39: The current call to TaskQueue.enqueue incorrectly omits the
required handler argument and misinterprets its boolean return as task_id;
update the call to pass the appropriate handler function (e.g.,
deployment_handler or the actual handler used for deployment tasks) as func and
treat the return value as a success flag (bool) instead of an ID: call await
task_queue.enqueue(func=deployment_handler, event_type="deployment",
repo_full_name=repo_full_name, installation_id=installation_id,
payload=payload), then check the boolean result and if true proceed to
build/return the correct response (generate or fetch a real task_id separately
if your API requires one) and if false log/return an error (e.g., 503) —
reference TaskQueue.enqueue and the local variable task_id (replace its usage
with a success check or a real id lookup).
In `@src/webhooks/handlers/issue_comment.py`:
- Line 153: The logger call uses structlog-style keyword args which the standard
logging module doesn't support; update the logger.info invocation that
references logger and comment_body (the line currently
"logger.info('extracting_acknowledgment_reason',
body_length=len(comment_body))") to use standard logging formatting and
positional args (e.g. include body_length in the format string and pass
len(comment_body) as an argument) so the message logs correctly at runtime.
- Line 33: The log call using logger.info("comment_processed",
commenter=commenter, body_length=len(comment_body)) is invalid for the standard
logging.Logger; update the logger usage in the issue comment handler to either
(a) convert the module logger to a structlog logger (align with PR) and ensure
the logger at module scope is created via structlog.get_logger() before calling
info with kwargs, or (b) keep the existing logging.getLogger(__name__) and
change the call in the handler to a formatted message using positional/format
placeholders or keyword interpolation (e.g., include commenter and body_length
in the message string) so no kwargs are passed to logging.Logger.info; locate
the logger initialization and the info call in the issue comment handler to make
the consistent change.
🟡 Minor comments (14)
docs/benchmarks.md-9-9 (1)
9-9: Fix the “70 + enterprise” typo for consistency.
The spacing reads like arithmetic and conflicts with “70+” later in the document.✏️ Suggested fix
-Our analysis of 70 + enterprise policies from major tech companies revealed a critical insight: **85% of real-world governance policies require context** and cannot be effectively enforced with traditional static rules. +Our analysis of 70+ enterprise policies from major tech companies revealed a critical insight: **85% of real-world governance policies require context** and cannot be effectively enforced with traditional static rules.src/api/rules.py-11-11 (1)
11-11:event_datafield is defined but unused.The
event_datafield is declared with a comment indicating "Advanced: pass extra event data for edge cases" but it's never passed toagent.execute(). Either implement the feature or remove the unused field to avoid confusion.🔧 If intended to be used, pass it to the agent
- result = await agent.execute(rule_description=request.rule_text) + result = await agent.execute(rule_description=request.rule_text, event_data=request.event_data)src/core/utils/caching.py-161-164 (1)
161-164: Useis not Noneto properly handle explicitttl=0.The truthiness check treats
ttl=0as falsy, causing it to default to 3600 instead of creating a cache with zero TTL. While no code currently passesttl=0, this design is surprising and prevents that use case. Change the condition toif ttl is not Noneto preserve explicit zero values.Proposed fix
- cache = AsyncCache(maxsize=maxsize, ttl=ttl) if ttl else TTLCache(maxsize=maxsize, ttl=ttl or 3600) + cache = ( + AsyncCache(maxsize=maxsize, ttl=ttl) + if ttl is not None + else TTLCache(maxsize=maxsize, ttl=3600) + )pyproject.toml-111-119 (1)
111-119: Inconsistency:respxmissing from[tool.uv]dev-dependencies.
respx>=0.20.0is added to[project.optional-dependencies] dev(line 38) but not to[tool.uv] dev-dependencies. This may cause issues depending on which installation method is used.Proposed fix
[tool.uv] dev-dependencies = [ "pytest>=7.4.0", "pytest-asyncio>=0.21.0", "pytest-cov>=4.1.0", + "respx>=0.20.0", "mypy>=1.7.0", "pre-commit>=3.5.0", "ruff>=0.1.0", ]pyproject.toml-28-28 (1)
28-28: Updategiturlparseversion constraint to be more specific.The constraint
>=0.1.0is very permissive and allows versions spanning from 0.1.0 to the current latest (0.14.0). Consider pinning to a more specific range (e.g.,>=0.10.0,<0.15.0or similar) to avoid unexpected behavior from incompatible future releases.tests/unit/tasks/test_queue.py-91-129 (1)
91-129: Stop workers at test end to avoid leaked background tasks.Tests that call
start_workers()shouldstop_workers()in afinallyblock to prevent task leakage across the suite.🛠️ Suggested fix
async def test_worker_processes_tasks(self, queue: TaskQueue, sample_payload: dict[str, object]) -> None: """Test that worker processes enqueued tasks.""" handler = AsyncMock() # Start the worker await queue.start_workers() - # Enqueue a task - await queue.enqueue(handler, "pull_request", sample_payload) - - # Wait for worker to process - await asyncio.sleep(0.1) - await queue.queue.join() - - # Verify handler was called - assert handler.called + try: + # Enqueue a task + await queue.enqueue(handler, "pull_request", sample_payload) + + # Wait for worker to process + await asyncio.sleep(0.1) + await queue.queue.join() + + # Verify handler was called + assert handler.called + finally: + await queue.stop_workers() @@ async def test_worker_handles_exceptions(self, queue: TaskQueue, sample_payload: dict[str, object]) -> None: """Test that worker continues after handler raises exception.""" # Create a handler that raises an exception failing_handler = AsyncMock(side_effect=ValueError("Test error")) success_handler = AsyncMock() # Start the worker await queue.start_workers() - # Enqueue failing task - await queue.enqueue(failing_handler, "pull_request", sample_payload) - - # Enqueue successful task with different payload - different_payload = {**sample_payload, "action": "closed"} - await queue.enqueue(success_handler, "pull_request", different_payload) - - # Wait for worker to process both - await asyncio.sleep(0.2) - await queue.queue.join() - - # Verify both handlers were called despite first one failing - assert failing_handler.called - assert success_handler.called + try: + # Enqueue failing task + await queue.enqueue(failing_handler, "pull_request", sample_payload) + + # Enqueue successful task with different payload + different_payload = {**sample_payload, "action": "closed"} + await queue.enqueue(success_handler, "pull_request", different_payload) + + # Wait for worker to process both + await asyncio.sleep(0.2) + await queue.queue.join() + + # Verify both handlers were called despite first one failing + assert failing_handler.called + assert success_handler.called + finally: + await queue.stop_workers() @@ async def test_enqueue_with_args_and_kwargs(self, queue: TaskQueue, sample_payload: dict[str, object]) -> None: """Test enqueue passes args and kwargs to handler.""" handler = AsyncMock() # Start worker await queue.start_workers() - # Enqueue with additional args and kwargs - event_mock = {"test": "data"} - await queue.enqueue(handler, "pull_request", sample_payload, event_mock, timeout=30) - - # Wait for processing - await asyncio.sleep(0.1) - await queue.queue.join() - - # Verify handler was called with correct args and kwargs - assert handler.called - call_args, call_kwargs = handler.call_args - assert call_args[0] == event_mock - assert call_kwargs["timeout"] == 30 + try: + # Enqueue with additional args and kwargs + event_mock = {"test": "data"} + await queue.enqueue(handler, "pull_request", sample_payload, event_mock, timeout=30) + + # Wait for processing + await asyncio.sleep(0.1) + await queue.queue.join() + + # Verify handler was called with correct args and kwargs + assert handler.called + call_args, call_kwargs = handler.call_args + assert call_args[0] == event_mock + assert call_kwargs["timeout"] == 30 + finally: + await queue.stop_workers()Also applies to: 155-169
src/api/rate_limit.py-21-28 (1)
21-28: Guard against missing client info when building the rate‑limit key.
request.clientcan beNonein proxy scenarios and certain test setups, which would raise anAttributeErrorwhen accessing.host. Add a safe fallback.🛠️ Suggested fix
- else: - key = f"ip:{request.client.host}" + else: + client_host = request.client.host if request.client else "unknown" + key = f"ip:{client_host}"src/webhooks/handlers/push.py-30-42 (1)
30-42: UnreachableImportErrorhandler is dead code.The
except ImportErrorblock (lines 36-39) is unreachable because thetryblock only containslog.info()and areturnstatement—neither of which can raiseImportError. This appears to be leftover code from a previous implementation that dynamically imported a processor.Remove the dead code or, if dynamic processor loading is planned, add a TODO comment explaining the intent.
🧹 Proposed fix: Remove unreachable exception handler
try: # Handler is thin—just logs and confirms readiness log.info("push_ready_for_processing") return WebhookResponse(status="success", detail="Push handler executed", event_type="push") - except ImportError: - # Deployment processor may not exist yet - log.warning("deployment_processor_not_found") - return WebhookResponse(status="success", detail="Push acknowledged (no processor)", event_type="push") except Exception as e: log.error("push_processing_failed", error=str(e), exc_info=True) return WebhookResponse(status="error", detail=f"Push processing failed: {str(e)}", event_type="push")tests/integration/test_repo_analysis.py-60-66 (1)
60-66: Alignai_generated_rateexpectation with the model default.
IfHygieneMetrics.ai_generated_ratedefaults toNone(as in the core model), this assertion will fail; either set it to0.0during hygiene computation or update the test expectation.💡 Possible test adjustment (if default is None)
- assert metrics.ai_generated_rate == 0.0 + assert metrics.ai_generated_rate is Nonesrc/agents/acknowledgment_agent/agent.py-212-233 (1)
212-233: Avoid truthiness checks for required kwargs.
Empty but valid payloads (e.g.,{}or[]) will be rejected. Preferis not Nonechecks and letevaluate_acknowledgmenthandle content validation.💡 Suggested fix
- if acknowledgment_reason and violations and pr_data and commenter and rules: + if ( + acknowledgment_reason is not None + and violations is not None + and pr_data is not None + and commenter is not None + and rules is not None + ):tests/unit/webhooks/test_router.py-108-123 (1)
108-123: Fix inconsistent comment: states 202 but asserts 200.The comment on line 120 says "Should return 202 for unsupported events per router logic" but the assertion on line 121 checks for status code 200. Update the comment to match the actual expected behavior.
📝 Suggested fix
- # Should return 202 for unsupported events per router logic - assert response.status_code == 200 + # Router returns 200 with "received" status for unsupported events + assert response.status_code == 200src/agents/repository_analysis_agent/nodes.py-91-92 (1)
91-92: Potential KeyError if API response item lacks 'name' key.If the GitHub API returns an item in
fileswithout anamefield, this list comprehension will raise aKeyError.Proposed defensive fix
- file_names = [f["name"] for f in files] if files else [] + file_names = [f.get("name") for f in files if f.get("name")] if files else []src/agents/repository_analysis_agent/nodes.py-51-57 (1)
51-57: Inconsistent key access pattern may cause KeyError.
pr_data["number"]uses direct key access while other fields use.get()with defaults. If the API response is malformed or missing thenumberfield, this will raise aKeyError.Proposed fix
return PRSignal( - pr_number=pr_data["number"], + pr_number=pr_data.get("number", 0), has_linked_issue=pr_data.get("has_issue_ref", False), author_association=pr_data.get("author_association", "NONE"), is_ai_generated_hint=is_ai_generated, lines_changed=pr_data.get("lines_changed", 0), )src/integrations/github/api.py-154-161 (1)
154-161: Unreachable code afterraise_for_status().Line 161 (
return []) is unreachable. Whenresponse.status != 200,raise_for_status()raises an exception, so the subsequentreturn []never executes.Proposed fix
async with session.get(url, headers=headers) as response: if response.status == 200: data = await response.json() return data if isinstance(data, list) else [data] # Raise exception for error statuses to avoid silent failures response.raise_for_status() - return []
🧹 Nitpick comments (24)
src/rules/models.py (1)
20-28: LGTM - Clean enum definition following established patterns.The
RuleCategoryenum correctly inherits fromstr, Enum(consistent withRuleSeverity), and the inline comments provide useful context for each category.One observation: The
Rulemodel has aseverity: RuleSeverityfield but no correspondingcategory: RuleCategoryfield. Since related agent models (RuleRecommendation) already use category, consider adding an optional field to theRulemodel for consistency:category: RuleCategory | None = Nonesrc/rules/utils/contributors.py (1)
143-156: Minor duplication in date extraction logic.The same date parsing pattern is repeated for commits, PRs, and issues. This is functional but could be extracted to a helper if this pattern expands.
src/rules/validators.py (1)
168-187: Use the shared glob matcher to support**and caching.Right now this path uses
_glob_to_regex+re.match, which ignores the newer glob matcher and misses**semantics. Reusing_matches_anykeeps behavior consistent and avoids re‑implementing glob logic.♻️ Proposed refactor
- # Glob→regex—simple, not robust. TODO: improve for edge cases. - regex_pattern = FilePatternCondition._glob_to_regex(pattern) - - # Pattern match—performance: optimize if file count high. - matching_files = [file for file in changed_files if re.match(regex_pattern, file)] + # Pattern match—reuse glob matcher to support ** and cached regexes. + has_match = any(_matches_any(file, [pattern]) for file in changed_files) @@ - if condition_type == "files_not_match_pattern": - return len(matching_files) == 0 - else: - return len(matching_files) > 0 + if condition_type == "files_not_match_pattern": + return not has_match + return has_matchsrc/webhooks/handlers/deployment_protection_rule.py (1)
18-27: Use structured key-value logging with structlog.F-strings embed data into the message, which loses structured fields in JSON output. With the JSONRenderer configured in structlog, pass fields as explicit kwargs for better filtering, aggregation, and observability.
♻️ Suggested refactor
- logger.info(f"🔄 Enqueuing deployment protection rule event for {event.repo_full_name}") + logger.info( + "🔄 Enqueuing deployment protection rule event", + repo_full_name=event.repo_full_name, + ) ... - logger.info(f"✅ Deployment protection rule event enqueued with task ID: {task_id}") + logger.info( + "✅ Deployment protection rule event enqueued", + task_id=task_id, + repo_full_name=event.repo_full_name, + )src/webhooks/handlers/check_run.py (2)
18-18: Use structured logging parameters instead of f-strings.When using structlog, prefer keyword arguments over f-strings to get proper structured log output that can be parsed and queried by log aggregation systems.
♻️ Proposed fix
- logger.info(f"🔄 Enqueuing check run event for {event.repo_full_name}") + logger.info("Enqueuing check run event", repo=event.repo_full_name)
27-27: Same structured logging issue.Apply the same refactor here for consistent structured logging.
♻️ Proposed fix
- logger.info(f"✅ Check run event enqueued with task ID: {task_id}") + logger.info("Check run event enqueued", task_id=task_id)README.md (1)
239-244: Consider clarifying rate limit scope and section placement.Two minor suggestions:
The section is under "Contributing & Development" but describes API usage behavior. Consider moving it closer to the "Usage" or "Configuration" sections for better discoverability.
Line 242: "Authenticated users are limited to 100 requests per hour" - consider clarifying whether this is per-user, per-token, or per-IP for consistency with the anonymous limit description.
src/webhooks/auth.py (1)
33-33: Use structured logging parameters for consistency.For consistency with structlog best practices across the codebase:
♻️ Proposed fix
- logger.warning("Received a request without the X-Hub-Signature-256 header.") + logger.warning("Missing webhook signature header", header="X-Hub-Signature-256")src/api/rules.py (1)
20-26: Consider checkingresult.successbefore accessing data.The comment at line 19 mentions the agent may throw, but per the agent's implementation, it catches exceptions and returns
AgentResultwithsuccess=False. The current code doesn't checkresult.success, which could return misleading values (e.g.,supported: Falsewhen the actual issue was an error, not infeasibility).♻️ Proposed fix
result = await agent.execute(rule_description=request.rule_text) - # Output: keep format stable for frontend. Brittle if agent changes keys. + if not result.success: + return { + "supported": False, + "snippet": "", + "feedback": result.message, + "error": True, + } + return { "supported": result.data.get("is_feasible", False), "snippet": result.data.get("yaml_content", ""), "feedback": result.message, }src/integrations/github/schemas.py (1)
11-26: Prefer datetime types for timestamps.GitHub API timestamps are ISO8601 strings (e.g.,
"2024-01-15T10:30:45Z"). Usingdatetimelets Pydantic automatically parse them and avoids the manual string handling pattern currently seen in downstream code likesrc/rules/utils/contributors.py, which manually callsdatetime.fromisoformat(date_str.replace("Z", "+00:00")).♻️ Proposed update
-from typing import Any +from datetime import datetime +from typing import Any @@ - created_at: str | None = None - updated_at: str | None = None - pushed_at: str | None = None + created_at: datetime | None = None + updated_at: datetime | None = None + pushed_at: datetime | None = NonePydantic v2.5+ automatically parses ISO8601 timestamps and serializes them back to strings for JSON compatibility, so this change is backward compatible.
src/core/config/settings.py (1)
117-119: Missing error handling for malformed environment variables.If
ANONYMOUS_RATE_LIMIT,AUTHENTICATED_RATE_LIMIT, orANALYSIS_TIMEOUTcontain non-numeric values,int()will raise an unhandledValueErrorat startup. Consider wrapping these in try/except with fallback to defaults, similar to howcorsconfig handlesJSONDecodeError.Proposed defensive parsing
- self.anonymous_rate_limit = int(os.getenv("ANONYMOUS_RATE_LIMIT", "5")) # Per hour - self.authenticated_rate_limit = int(os.getenv("AUTHENTICATED_RATE_LIMIT", "100")) # Per hour - self.analysis_timeout = int(os.getenv("ANALYSIS_TIMEOUT", "60")) # Seconds + self.anonymous_rate_limit = self._safe_int("ANONYMOUS_RATE_LIMIT", 5) # Per hour + self.authenticated_rate_limit = self._safe_int("AUTHENTICATED_RATE_LIMIT", 100) # Per hour + self.analysis_timeout = self._safe_int("ANALYSIS_TIMEOUT", 60) # SecondsAdd a helper method:
def _safe_int(self, env_var: str, default: int) -> int: """Parse integer from env var with fallback to default.""" try: return int(os.getenv(env_var, str(default))) except ValueError: return defaultsrc/integrations/github/service.py (1)
50-60: Consider adding GitHub API recommended headers.GitHub recommends setting
Accept: application/vnd.github+jsonand optionallyX-GitHub-Api-Versionfor API stability. This helps ensure consistent response formats.Proposed improvement
try: - async with httpx.AsyncClient(timeout=60.0) as client: - response = await client.get(api_url) + headers = {"Accept": "application/vnd.github+json"} + async with httpx.AsyncClient(timeout=60.0, headers=headers) as client: + response = await client.get(api_url)src/api/rate_limit.py (1)
30-41: Consider evicting inactive keys to avoid unbounded in‑memory growth.Keys never expire, so long‑running processes can grow memory with high‑cardinality IP/user traffic. Consider periodic cleanup or a bounded/TTL store.
src/webhooks/models.py (1)
1-36: Well-structured webhook models with good documentation.The models are clean, well-documented with docstrings and field descriptions, and follow Pydantic best practices.
One consideration: the
statusfield onWebhookResponse(line 34) is a free-formstr. If the valid statuses are known and fixed (e.g., "success", "received", "error", "queued" as seen in tests and handlers), consider usingLiteralto provide type safety and self-documenting constraints.♻️ Optional: Constrain status values with Literal
-from pydantic import BaseModel, Field +from typing import Literal +from pydantic import BaseModel, Field + +WebhookStatus = Literal["success", "received", "error", "queued"] class WebhookResponse(BaseModel): """Standardized response model for all webhook handlers.""" - status: str = Field(..., description="Processing status: success, received, error") + status: WebhookStatus = Field(..., description="Processing status: success, received, error, queued") detail: str | None = Field(None, description="Additional context or error message") event_type: str | None = Field(None, description="Normalized GitHub event type")tests/integration/webhooks/test_webhook_flow.py (1)
15-35: Ensure TaskQueue workers are stopped after tests.Workers are started but never stopped, which can leave background tasks running and cause warnings or cross-test interference.
One option: async fixture teardown
-import pytest +import pytest +import pytest_asyncio @@ -@pytest.fixture -def fresh_queue() -> TaskQueue: - """Create a fresh task queue for testing.""" - return TaskQueue() +@pytest_asyncio.fixture +async def fresh_queue() -> TaskQueue: + """Create a fresh task queue for testing.""" + queue = TaskQueue() + yield queue + await queue.stop_workers()You can also rely on
queue.join()and drop fixed sleeps to reduce flakiness.tests/unit/webhooks/test_router.py (2)
10-12: Remove unused functionmock_verify_signature.This function is defined but never used. The
appfixture on line 23 useslambda: Trueinstead. Consider removing this dead code.🧹 Suggested removal
-async def mock_verify_signature() -> bool: - """Mock verification dependency that always returns True.""" - return True - -
157-177: Consider more specific assertion for structured logging validation.The current assertion
len(calls) > 0only checks that at least one call contains "webhook_validated". A more specific assertion would verify the exact log event name.💡 Suggested improvement
# Verify structured logging was called assert mock_logger.info.called - # Check that webhook_validated was logged - calls = [call for call in mock_logger.info.call_args_list if "webhook_validated" in str(call)] - assert len(calls) > 0 + # Verify webhook_validated event was logged + call_args = [str(call) for call in mock_logger.info.call_args_list] + assert any("webhook_validated" in arg for arg in call_args), f"Expected 'webhook_validated' log event, got: {call_args}"src/api/recommendations.py (2)
156-161: Moveyamlimport to module level.Importing inside the function adds overhead on each request. Move to the top of the file with other imports.
♻️ Suggested fix
At the top of the file (around line 6):
from pydantic import BaseModel, Field, HttpUrl +import yamlThen remove line 157:
# Generate rules_yaml from recommendations - import yaml - rules_output = {"rules": [rec.model_dump(exclude_none=True) for rec in final_state.get("recommendations", [])]}
188-208: Consider using a Pydantic model forpayloadinstead of rawdict.Using
payload: dictrequires manual validation (lines 200-208). A Pydantic model would provide automatic validation, type hints, and better API documentation.♻️ Suggested model
class ProceedWithPRRequest(BaseModel): """Payload for creating a PR with recommended rules.""" repository_full_name: str = Field(..., description="Target repository in 'owner/repo' format") rules_yaml: str = Field(..., description="YAML content for the rules file") installation_id: int = Field(..., description="GitHub App installation ID") branch_name: str = Field("watchflow/rules", description="Branch name for the PR") `@router.post`("/recommend/proceed-with-pr", ...) async def proceed_with_pr( payload: ProceedWithPRRequest, user: User | None = Depends(get_current_user_optional) ): # Manual validation no longer needed ...src/agents/repository_analysis_agent/models.py (1)
9-24: Broad exception handling may hide errors.The
except Exceptionon line 22 catches all exceptions including programming errors. Consider catching more specific exceptions or logging the error.💡 Suggested improvement
try: p = parse(identifier) if p.valid and p.owner and p.repo: return f"{p.owner}/{p.repo}" # Fallback for simple "owner/repo" strings that might fail strict URL parsing if "/" in identifier and not identifier.startswith(("http", "git@")): return identifier.strip().strip("/") return identifier - except Exception: + except (ValueError, AttributeError, TypeError) as e: # If parsing fails completely, return original to let validator catch it later + import structlog + structlog.get_logger().debug("parse_github_repo_identifier_fallback", identifier=identifier, error=str(e)) return identifiersrc/integrations/github/api.py (4)
19-41: Remove unused module-level GraphQL query.This
_PR_HYGIENE_QUERYdefinition is never used. Thefetch_pr_hygiene_statsmethod at line 1157 defines its own local version with additional fields (mergedAt,additions,deletions,author, detailedreviews,files). The module-level definition is dead code.Proposed fix - remove dead code
-_PR_HYGIENE_QUERY = """ -query PRHygiene($owner: String!, $repo: String!) { - repository(owner: $owner, name: $repo) { - pullRequests(last: 20, states: [MERGED, CLOSED]) { - nodes { - number - title - body - changedFiles - comments { - totalCount - } - closingIssuesReferences(first: 1) { - totalCount - } - reviews(first: 1) { - totalCount - } - } - } - } -} -""" -
276-285: Accessing internal_loopattribute is fragile.
self._session._loopis an internal/private attribute ofaiohttp.ClientSession. While the comment acknowledges avoiding the deprecated.loopproperty, accessing_loopdirectly may break in future aiohttp versions without notice.Consider using a safer approach that doesn't rely on internal attributes:
Alternative approach
- try: - current_loop = asyncio.get_running_loop() - # Try to access session's internal loop to check if it's the same - # If the session's loop is closed, this will fail - if self._session._loop != current_loop or self._session._loop.is_closed(): - await self._session.close() - self._session = aiohttp.ClientSession() - except RuntimeError: - # No running loop or loop is closed, recreate session - self._session = aiohttp.ClientSession() + # Safest approach: recreate session if event loop changed + # This is common in test environments + try: + # Attempt a trivial operation to verify the session is usable + _ = asyncio.get_running_loop() + except RuntimeError: + # No running loop, recreate session + self._session = aiohttp.ClientSession()
1090-1117: GraphQL API typically requires authentication.The GitHub GraphQL API requires authentication for most queries. Using
allow_anonymous=Trueat line 1110 may cause unexpected failures. Additionally, line 1117 raises a genericExceptionrather than a more specific type.Consider:
- Requiring authentication for GraphQL queries (most queries need it)
- Using a custom exception type for clarity
Proposed improvements
- headers = await self._get_auth_headers(allow_anonymous=True) + # GraphQL API requires authentication for most queries + headers = await self._get_auth_headers(allow_anonymous=False) if not headers: - # Fallback or error? GraphQL usually demands auth. - # If we have no headers, we likely can't query GraphQL successfully for many fields. - # We'll try with empty headers if that's what _get_auth_headers returns (it returns None on failure). - # If None, we can't proceed. logger.error("GraphQL execution failed: No authentication headers available.") - raise Exception("Authentication required for GraphQL query.") + raise GitHubGraphQLError([{"message": "Authentication required for GraphQL query."}])
10-17: Inconsistent logging: mixingloggingandstructlog.The module uses standard
logging(line 17) but also imports and usesstructlog(line 10, line 983). Creating a new structlog logger insidefetch_recent_pull_requestsleads to inconsistent log formatting across the module.Proposed fix - use structlog consistently
import structlog from cachetools import TTLCache from tenacity import retry, stop_after_attempt, wait_exponential from src.core.config import config from src.core.errors import GitHubGraphQLError -logger = logging.getLogger(__name__) +logger = structlog.get_logger(__name__)And remove the local logger creation in
fetch_recent_pull_requests:async def fetch_recent_pull_requests( ... ) -> list[dict[str, Any]]: ... - - logger = structlog.get_logger() - try:Also applies to: 983-984
| - repo: https://github.com/astral-sh/ruff-pre-commit | ||
| rev: v0.9.9 | ||
| rev: v0.3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
latest version of ruff linter 2025 2026
💡 Result:
As of January 24, 2026:
- Latest Ruff release (GitHub):
0.14.14, released January 22, 2026. [1] - Latest Ruff version on PyPI:
0.14.13, uploaded January 15, 2026. [2]
Sources: [1] [2]
Update Ruff to a current version—v0.3.0 is outdated.
The pinned version v0.3.0 is approximately two years old (from early 2024). The current latest version is v0.14.14 (released January 22, 2026). This appears to be a downgrade or typo. Update to at least v0.14.13 or v0.14.14 to benefit from bug fixes, security updates, and improvements across 11+ minor versions.
🤖 Prompt for AI Agents
In @.pre-commit-config.yaml around lines 17 - 18, The pre-commit entry for the
Ruff hook is pinned to an outdated revision (repo:
https://github.com/astral-sh/ruff-pre-commit with rev: v0.3.0); update the rev
to a current release (preferably v0.14.14, or at minimum v0.14.13) in the
.pre-commit-config.yaml and then run pre-commit autoupdate or reinstall hooks to
ensure the newer Ruff version is used.
| async def get_current_user_optional(request: Request) -> User | None: | ||
| """ | ||
| Auth check—don't fail if missing. Critical for public repo support (Phase 1). | ||
| """ | ||
| auth_header = request.headers.get("Authorization") | ||
|
|
||
| if not auth_header: | ||
| return None | ||
|
|
||
| try: | ||
| # Token extraction—fragile if header format changes. | ||
| scheme, token = auth_header.split() | ||
| if scheme.lower() != "bearer": | ||
| return None | ||
|
|
||
| # TODO: Wire to real IdP (Supabase/Auth0). For now, fake user if token present. WARNING: Must verify signature in prod. | ||
| return User(id=123, username="authenticated_user", email="user@example.com", github_token=token) | ||
| except Exception as e: | ||
| logger.warning(f"Failed to parse auth header: {e}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bearer token is accepted without verification (auth bypass).
Any request with a Bearer header becomes authenticated, which allows unauthorized access to protected endpoints. This must validate signatures/claims (or reject tokens) before creating a User.
🤖 Prompt for AI Agents
In `@src/api/dependencies.py` around lines 23 - 41, The get_current_user_optional
function currently accepts any Bearer token and returns a fake User, enabling
auth bypass; replace this by validating the token before constructing a User:
call a token verification helper (e.g., verify_token or integrate with your IdP
client) from get_current_user_optional to validate signature and claims, reject
invalid/malformed tokens by returning None and logging the failure, and only
create and return a User (or map claims to User) when verification succeeds;
ensure exceptions from verification are caught, logged via logger.warning, and
result in None instead of a fake authenticated user.
| # Check for linked issues (heuristic: look for issue references in body) | ||
| body = pr.get("body") or "" | ||
| title = pr.get("title") or "" | ||
| has_issue_ref = self._detect_issue_references(body, title) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🏁 Script executed:
# First, let's search for _detect_issue_references in the entire codebase
echo "=== Searching for _detect_issue_references ==="
rg "_detect_issue_references" -n --type py
echo ""
echo "=== Searching for the method definition specifically ==="
ast-grep --pattern 'def _detect_issue_references($$$)'
echo ""
echo "=== Checking file existence and size ==="
if [ -f "src/integrations/github/api.py" ]; then
wc -l "src/integrations/github/api.py"
echo ""
echo "=== Context around line 1023 ==="
sed -n '1015,1035p' "src/integrations/github/api.py"
fiRepository: warestack/watchflow
Length of output: 1491
🏁 Script executed:
# Check for method in entire repository
echo "=== Searching entire repo for method definition ==="
rg "def _detect_issue_references" -n --type py
echo ""
echo "=== Check the class definition at start of file ==="
head -100 src/integrations/github/api.py | grep -A 5 "^class"
echo ""
echo "=== Check if method is defined anywhere later in file ==="
grep -n "def _detect" src/integrations/github/api.pyRepository: warestack/watchflow
Length of output: 454
🏁 Script executed:
# List all methods in GitHubClient class
echo "=== All methods in GitHubClient class ==="
rg "^\s{0,4}def " src/integrations/github/api.py -n
echo ""
echo "=== Check for any _detect methods in file ==="
rg "_detect" src/integrations/github/api.py -nRepository: warestack/watchflow
Length of output: 348
Critical: Undefined method _detect_issue_references will cause AttributeError.
Line 1023 calls self._detect_issue_references(body, title), but this method is not defined anywhere in the GitHubClient class. This will raise an AttributeError at runtime when fetch_recent_pull_requests is invoked.
Proposed fix - add the missing method
+ def _detect_issue_references(self, body: str, title: str) -> bool:
+ """
+ Detect if PR body or title contains issue references.
+
+ Looks for common patterns like:
+ - `#123`, fixes `#123`, closes `#123`
+ - owner/repo#123
+ """
+ import re
+ # Pattern for issue references: `#number` or owner/repo#number
+ pattern = r'(?:(?:fix(?:es|ed)?|clos(?:es|ed)?|resolv(?:es|ed)?)\s+)?(?:[\w-]+/[\w-]+)?#\d+'
+ combined_text = f"{title} {body}".lower()
+ return bool(re.search(pattern, combined_text, re.IGNORECASE))
+
async def fetch_recent_pull_requests(📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # Check for linked issues (heuristic: look for issue references in body) | |
| body = pr.get("body") or "" | |
| title = pr.get("title") or "" | |
| has_issue_ref = self._detect_issue_references(body, title) | |
| def _detect_issue_references(self, body: str, title: str) -> bool: | |
| """ | |
| Detect if PR body or title contains issue references. | |
| Looks for common patterns like: | |
| - `#123`, fixes `#123`, closes `#123` | |
| - owner/repo#123 | |
| """ | |
| import re | |
| # Pattern for issue references: `#number` or owner/repo#number | |
| pattern = r'(?:(?:fix(?:es|ed)?|clos(?:es|ed)?|resolv(?:es|ed)?)\s+)?(?:[\w-]+/[\w-]+)?#\d+' | |
| combined_text = f"{title} {body}".lower() | |
| return bool(re.search(pattern, combined_text, re.IGNORECASE)) |
🤖 Prompt for AI Agents
In `@src/integrations/github/api.py` around lines 1020 - 1024,
fetch_recent_pull_requests calls a non-existent helper
self._detect_issue_references(...) which will raise AttributeError; add a
private method _detect_issue_references(self, body, title) to the GitHubClient
class that returns a boolean by scanning the PR body and title for issue
reference patterns (e.g. plain `#123`, owner/repo#123, or keywords like
"fixes|closes|resolves" followed by an issue ref) using a safe regex and ensure
it is used by fetch_recent_pull_requests; keep the method name exactly
_detect_issue_references so the existing call resolves.
| commits = [ | ||
| Commit( | ||
| oid=node["commit"]["oid"], | ||
| message=node["commit"]["message"], | ||
| author=node["commit"].get("author", {}).get("name", "Unknown"), | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard against null commit author to avoid AttributeError.
If GraphQL returns author: null, .get("author", {}) returns None and .get("name") crashes.
Proposed fix
- author=node["commit"].get("author", {}).get("name", "Unknown"),
+ author=(node["commit"].get("author") or {}).get("name", "Unknown"),📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| commits = [ | |
| Commit( | |
| oid=node["commit"]["oid"], | |
| message=node["commit"]["message"], | |
| author=node["commit"].get("author", {}).get("name", "Unknown"), | |
| ) | |
| commits = [ | |
| Commit( | |
| oid=node["commit"]["oid"], | |
| message=node["commit"]["message"], | |
| author=(node["commit"].get("author") or {}).get("name", "Unknown"), | |
| ) |
🤖 Prompt for AI Agents
In `@src/integrations/github/graphql_client.py` around lines 89 - 94, The list
comprehension building Commit instances can crash when GraphQL returns author:
null because node["commit"].get("author", {}) will be None; update the Commit
construction (where Commit(...) is created) to first coerce the author dict to a
safe mapping (e.g., author_info = node["commit"].get("author") or {}) and then
use author_info.get("name", "Unknown") for the author field so it never raises
an AttributeError.
| async def enqueue( | ||
| self, | ||
| func: Callable[..., Coroutine[Any, Any, Any]], | ||
| event_type: str, | ||
| payload: dict[str, Any], | ||
| *args: Any, | ||
| **kwargs: Any, | ||
| ) -> bool: | ||
| """Adds a task to the queue if it is not a duplicate.""" | ||
| task_id = self._generate_task_id(event_type, payload) | ||
|
|
||
| if task_id in self.processed_hashes: | ||
| logger.info("task_skipped_duplicate", task_id=task_id, event_type=event_type) | ||
| return False | ||
|
|
||
| task = Task(task_id=task_id, event_type=event_type, payload=payload, func=func, args=args, kwargs=kwargs) | ||
| await self.queue.put(task) | ||
| self.processed_hashes.add(task_id) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Dedup entries never expire and are marked before execution.
This can (a) grow memory unbounded with unique events and (b) permanently drop retries if a task fails because the hash is already marked processed. Consider deferring “processed” marking until success and adding TTL/eviction for old hashes.
🤖 Prompt for AI Agents
In `@src/tasks/task_queue.py` around lines 43 - 61, The current enqueue method
adds task_id to processed_hashes immediately which both leaks memory and
prevents retries on failure; change the behavior so enqueue (in enqueue and
using _generate_task_id, Task, queue.put) only checks processed_hashes but does
NOT add the id—move the addition to the successful completion path in the
worker/executor that runs the Task (i.e., after the task.func finishes without
error) so failed tasks can be retried. Also replace or augment processed_hashes
with a TTL/evicting structure (e.g., an async-safe TTL cache or an OrderedDict
with timestamp eviction) to bound memory growth and periodically prune expired
entries; ensure all access to processed_hashes is concurrency-safe in the async
executor and enqueue.
|
|
||
| # DEBUG: Log all headers to debug missing signature issue | ||
| logger.info("request_headers_received", headers=dict(request.headers)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Security: Remove or reduce header logging - potential secrets exposure.
Logging all request headers at INFO level can expose sensitive data in production logs:
Authorizationheaders with tokensCookieheaders with session data- Custom API keys in headers
The comment says "DEBUG" but uses logger.info(). Either remove this debug code before merging, or at minimum:
- Change to
logger.debug() - Filter out sensitive headers (
Authorization,Cookie, etc.)
🔒 Proposed fix - remove debug logging
- # DEBUG: Log all headers to debug missing signature issue
- logger.info("request_headers_received", headers=dict(request.headers))
-
if not signature:🔒 Alternative - filter sensitive headers
- # DEBUG: Log all headers to debug missing signature issue
- logger.info("request_headers_received", headers=dict(request.headers))
+ # Log non-sensitive headers for debugging
+ safe_headers = {k: v for k, v in request.headers.items()
+ if k.lower() not in ("authorization", "cookie", "x-api-key")}
+ logger.debug("request_headers_received", headers=safe_headers)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| # DEBUG: Log all headers to debug missing signature issue | |
| logger.info("request_headers_received", headers=dict(request.headers)) | |
| if not signature: |
| # DEBUG: Log all headers to debug missing signature issue | |
| logger.info("request_headers_received", headers=dict(request.headers)) | |
| # Log non-sensitive headers for debugging | |
| safe_headers = {k: v for k, v in request.headers.items() | |
| if k.lower() not in ("authorization", "cookie", "x-api-key")} | |
| logger.debug("request_headers_received", headers=safe_headers) |
🤖 Prompt for AI Agents
In `@src/webhooks/auth.py` around lines 28 - 30, The current call
logger.info("request_headers_received", headers=dict(request.headers)) exposes
sensitive headers; either remove this debug logging entirely or change it to
logger.debug and sanitize headers first by filtering out Authorization, Cookie,
Set-Cookie, Proxy-Authorization, and any custom API key headers from
request.headers before logging; update the code around the logger.info call (the
request_headers_received logging) to build a safe headers dict that masks or
omits those sensitive keys and log that instead.
| # Enqueue: async, may fail if queue overloaded. | ||
| task_id = await task_queue.enqueue( | ||
| event_type="deployment", repo_full_name=repo_full_name, installation_id=installation_id, payload=payload | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
enqueue call no longer matches TaskQueue API (missing handler + return type).
TaskQueue.enqueue now requires a handler function and returns bool, but this call provides no func and treats the return as task_id. This will fail at runtime and returns the wrong response shape.
🤖 Prompt for AI Agents
In `@src/webhooks/handlers/deployment.py` around lines 36 - 39, The current call
to TaskQueue.enqueue incorrectly omits the required handler argument and
misinterprets its boolean return as task_id; update the call to pass the
appropriate handler function (e.g., deployment_handler or the actual handler
used for deployment tasks) as func and treat the return value as a success flag
(bool) instead of an ID: call await task_queue.enqueue(func=deployment_handler,
event_type="deployment", repo_full_name=repo_full_name,
installation_id=installation_id, payload=payload), then check the boolean result
and if true proceed to build/return the correct response (generate or fetch a
real task_id separately if your API requires one) and if false log/return an
error (e.g., 503) — reference TaskQueue.enqueue and the local variable task_id
(replace its usage with a success check or a real id lookup).
| installation_id = event.installation_id | ||
|
|
||
| logger.info(f"🔄 Processing comment from {commenter}: {comment_body[:50]}...") | ||
| logger.info("comment_processed", commenter=commenter, body_length=len(comment_body)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Runtime error: Standard logging does not support keyword arguments.
Line 33 uses logger.info("comment_processed", commenter=commenter, body_length=len(comment_body)), but the logger on line 12 is initialized via logging.getLogger(__name__). Standard Python logging does not accept arbitrary keyword arguments—this will raise a TypeError at runtime.
Either switch to structlog (as used in other parts of this PR) or use standard logging format strings.
Option 1: Use standard logging
- logger.info("comment_processed", commenter=commenter, body_length=len(comment_body))
+ logger.info("comment_processed commenter=%s body_length=%d", commenter, len(comment_body))Option 2: Switch to structlog (consistent with PR direction)
-import logging
+import structlog
-logger = logging.getLogger(__name__)
+logger = structlog.get_logger(__name__)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| logger.info("comment_processed", commenter=commenter, body_length=len(comment_body)) | |
| logger.info("comment_processed commenter=%s body_length=%d", commenter, len(comment_body)) |
🤖 Prompt for AI Agents
In `@src/webhooks/handlers/issue_comment.py` at line 33, The log call using
logger.info("comment_processed", commenter=commenter,
body_length=len(comment_body)) is invalid for the standard logging.Logger;
update the logger usage in the issue comment handler to either (a) convert the
module logger to a structlog logger (align with PR) and ensure the logger at
module scope is created via structlog.get_logger() before calling info with
kwargs, or (b) keep the existing logging.getLogger(__name__) and change the call
in the handler to a formatted message using positional/format placeholders or
keyword interpolation (e.g., include commenter and body_length in the message
string) so no kwargs are passed to logging.Logger.info; locate the logger
initialization and the info call in the issue comment handler to make the
consistent change.
| comment_body = comment_body.strip() | ||
|
|
||
| logger.info(f"🔍 Extracting acknowledgment reason from: '{comment_body}'") | ||
| logger.info("extracting_acknowledgment_reason", body_length=len(comment_body)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same logging issue: keyword arguments with standard logger.
This line also uses structlog-style keyword arguments with the standard logging module, which will fail at runtime.
Fix to use standard logging format
- logger.info("extracting_acknowledgment_reason", body_length=len(comment_body))
+ logger.info("extracting_acknowledgment_reason body_length=%d", len(comment_body))🤖 Prompt for AI Agents
In `@src/webhooks/handlers/issue_comment.py` at line 153, The logger call uses
structlog-style keyword args which the standard logging module doesn't support;
update the logger.info invocation that references logger and comment_body (the
line currently "logger.info('extracting_acknowledgment_reason',
body_length=len(comment_body))") to use standard logging formatting and
positional args (e.g. include body_length in the format string and pass
len(comment_body) as an argument) so the message logs correctly at runtime.
…improvements - Codeowner bypass rate calculation - Issue-diff mismatch detection - require_linked_issue validator - Rate limit handling with warnings - New API endpoints: auth/validate-token, repos/installation - Standardize responses (rule_yaml) - Refactor prompts (concise, schema-driven)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 10
🤖 Fix all issues with AI agents
In `@src/agents/repository_analysis_agent/nodes.py`:
- Around line 356-383: The regex in variable file_pattern and the handling of
matches is returning only the captured extension (due to the capturing group) so
mentioned_files contains extensions instead of full paths; update file_pattern
to use a non-capturing group (e.g., replace "(py|ts|...)" with "(?:py|ts|...)"
or otherwise ensure the full match is returned), call re.findall to get
full-path matches, and simplify the population of mentioned_files by extending
with matches directly (remove the m[0] tuple handling); keep the subsequent
comparison logic that uses mentioned_files and changed_file_names and leave
issue_diff_mismatch_count increments unchanged.
In `@src/api/auth.py`:
- Around line 91-96: The except block currently returns an HTTPException with
the raw exception string (detail=f"Failed to validate token: {str(e)}"), which
can leak sensitive info; change the HTTPException detail to a generic message
(e.g., "Failed to validate token") while keeping the full exception logged via
logger.exception (reference symbols: logger.exception, HTTPException,
status.HTTP_500_INTERNAL_SERVER_ERROR) so internal logs preserve details but
clients receive no sensitive information.
- Around line 64-67: The HTTPException currently includes raw error_text from
GitHub in the response; change it to return a generic client-facing message
(e.g., "Failed to validate token") and move the raw error_text to server-side
logs instead: call the existing logger (e.g., logger.error or app_logger.error)
to record error_text and any context before raising the HTTPException without
the sensitive details, keeping the
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR and referencing the same raise
HTTPException(...) site and the error_text variable for logging.
In `@src/api/rate_limit.py`:
- Around line 27-32: The current rate-limit key logic can raise AttributeError
when request.client is None and incorrectly uses "user:None" when user.email is
None; update the branch in the rate limiting code to only treat requests as
authenticated when user and user.email are truthy (use user.email explicitly),
and compute client_host defensively (e.g., client_host = request.client.host if
request.client and getattr(request.client, "host", None) else
request.headers.get("x-forwarded-for", "unknown")), then set key =
f"user:{user.email}" and limit = AUTH_LIMIT only when user.email is present,
otherwise set key = f"ip:{client_host}" and limit = ANON_LIMIT; reference the
variables/user checks and AUTH_LIMIT/ANON_LIMIT in the change.
In `@src/api/recommendations.py`:
- Around line 448-451: The log currently records raw client IP via
request.client.host into client_ip and passes it to
logger.info("analysis_requested", ...), which may be PII; change this by either
omitting the raw IP or replacing it with a non-reversible identifier (e.g., a
salted hash/HMAC using an application secret) before logging. Locate the
client_ip assignment and the logger.info call in recommendations.py (references:
client_ip, request.client.host, user.email, logger.info("analysis_requested"))
and update the code to compute and log a hashed_ip or simply exclude the IP
field based on a config flag (e.g., LOG_RAW_IPS=false), ensuring the hashing
uses a stable secret so the identifier can be correlated without exposing the
plain IP.
In `@src/api/repos.py`:
- Around line 40-45: The handler currently exposes internal exception text in
the API response; keep detailed info in logs but return a generic message to
clients—remove str(e) from the HTTPException.detail and replace with a
non-sensitive string like "Failed to check installation" (or similar), while
retaining logger.exception("installation_check_failed", repo=repo_full_name,
error=str(e)) or better: keep logger.exception("installation_check_failed",
repo=repo_full_name) so the full exception is recorded server-side; update the
raise HTTPException call (status.HTTP_500_INTERNAL_SERVER_ERROR) to use the
generic detail and do not include the exception string.
In `@src/integrations/github/api.py`:
- Around line 1098-1109: The except block is catching httpx.HTTPStatusError in
aiohttp-based code; change the exception type to aiohttp.ClientResponseError and
log the correct status attribute (use e.status instead of
e.response.status_code) so the logger call in this handler (where
logger.error("pr_fetch_http_error", repo=repo_full_name, status_code=...,
error_type="network_error", error=str(e))) reports the proper aiohttp error
details; keep the existing fallback except Exception block unchanged.
- Around line 153-160: The code after response.raise_for_status() is
unreachable; update the async GET block (the "async with session.get(...)" scope
that calls response.raise_for_status()) to either remove the unreachable "return
[]" or catch the HTTP error and return [] for non-200 responses to match other
methods in this class — e.g., wrap response.raise_for_status() in a try/except
for aiohttp.ClientResponseError (or a broad Exception), logging or swallowing
the error and returning [] from the function instead of leaving a dead return
statement.
In `@src/rules/validators.py`:
- Around line 988-1049: In validate (async def validate), guard and normalize
the commits variable before iterating: when obtaining commits from
event.get("commits", []) or pull_request.get("commits", []), check
isinstance(commits, list) (or iterable of dicts) and if not, set commits = []
(or skip commit scanning) to avoid iterating an integer commit count from
pull_request; ensure the commit_message loop only runs when commits is a list of
dicts with "message" keys.
In `@src/tasks/task_queue.py`:
- Around line 68-79: The worker currently logs failures and drops tasks; modify
_worker to support bounded retries and a dead-letter queue: add a retry counter
on the Task (e.g., task.retries and task.max_retries) or maintain a separate
retry map, and on exception increment task.retries and if retries < max_retries
re-enqueue the task (optionally with a small backoff) otherwise push the task
into a dead_letter_queue (e.g., self.dead_letter_queue) for inspection; ensure
queue.task_done() is still called once per dequeue and include task identifiers
(task.task_id) when moving to dead-letter so failed work can be recovered or
inspected later.
♻️ Duplicate comments (5)
src/tasks/task_queue.py (1)
46-66: Deduplication concerns already flagged.The issues with marking
processed_hashesbefore task execution (blocking retries on failure) and unbounded memory growth have been raised in a previous review. Please address the earlier feedback.src/api/dependencies.py (1)
38-40: Auth bypass: Any Bearer token creates an authenticated user without validation.This returns a hardcoded
User(id=123, username="authenticated_user")for any Bearer token, which allows unauthorized access. While the comment notes this is intentional for the "open-source version," the fabricated user identity could cause issues in audit logs and any user-specific logic downstream.Consider at minimum validating the token against GitHub's
/userendpoint (similar tovalidate_tokeninsrc/api/auth.py) to populate real user data, or clearly documenting this limitation in API documentation.src/integrations/github/api.py (1)
1057-1060: Critical:_detect_issue_referencesmethod is called but not defined.Line 1060 calls
self._detect_issue_references(body, title), but this method does not exist in theGitHubClientclass. This will raise anAttributeErrorat runtime whenfetch_recent_pull_requestsprocesses merged PRs.🐛 Proposed fix - add the missing method
+ def _detect_issue_references(self, body: str, title: str) -> bool: + """ + Detect if PR body or title contains issue references. + + Looks for common patterns like: + - `#123`, fixes `#123`, closes `#123` + - owner/repo#123 + """ + import re + combined_text = f"{title} {body}".lower() + # Pattern matches: `#number` or owner/repo#number, optionally preceded by keywords + pattern = r'(?:(?:fix(?:es|ed)?|clos(?:es|ed)?|resolv(?:es|ed)?)\s+)?(?:[\w-]+/[\w-]+)?#\d+' + return bool(re.search(pattern, combined_text, re.IGNORECASE)) + async def fetch_recent_pull_requests(src/main.py (1)
116-122: CORS configuration with wildcard origins and credentials is invalid.This was flagged in a previous review. Per the Fetch specification,
Access-Control-Allow-Origin: *cannot be used withAccess-Control-Allow-Credentials: true. Browsers will block credentialed requests with this configuration.src/rules/validators.py (1)
238-243:has_min_approvalsalways passes.
This validator is registered and currently returnsTrueunconditionally, so approval gating is bypassed. Implement the approval count (or delegate toMinApprovalsCondition).🐛 Proposed fix
- # Unused: min_approvals—left for future logic. - # min_approvals = parameters.get("min_approvals", 1) - - # TODO: Check actual PR reviews. Always returns True—demo only. - return True + min_approvals = parameters.get("min_approvals", 1) + reviews = event.get("reviews", []) + approved_count = sum(1 for review in reviews if review.get("state") == "APPROVED") + logger.debug( + "ApprovalCountCondition: PR has %d approvals, requires %d", + approved_count, + min_approvals, + ) + return approved_count >= min_approvals
🧹 Nitpick comments (18)
src/event_processors/pull_request.py (1)
319-327: URL‑encode query params when building the landing URL.Encoding avoids malformed links and keeps the generated URLs robust as inputs evolve.
♻️ Proposed fix
@@ -import time +import time +import urllib.parse @@ - landing_url = "https://watchflow.dev" - if repo_full_name and installation_id: - landing_url = ( - f"https://watchflow.dev/analyze?installation_id={installation_id}&repo={repo_full_name}" - ) - elif repo_full_name: - landing_url = f"https://watchflow.dev/analyze?repo={repo_full_name}" + landing_url = "https://watchflow.dev" + if repo_full_name and installation_id: + query = urllib.parse.urlencode( + {"installation_id": installation_id, "repo": repo_full_name} + ) + landing_url = f"https://watchflow.dev/analyze?{query}" + elif repo_full_name: + query = urllib.parse.urlencode({"repo": repo_full_name}) + landing_url = f"https://watchflow.dev/analyze?{query}"src/tasks/task_queue.py (3)
19-19: The| Anyunion defeats type safety.The type
Callable[..., Coroutine[Any, Any, Any]] | Anyaccepts any value, making the annotation meaningless. If the intent is to accept async callables, remove| Any.Suggested fix
- func: Callable[..., Coroutine[Any, Any, Any]] | Any = Field(..., description="Handler function to execute") + func: Callable[..., Coroutine[Any, Any, Any]] = Field(..., description="Handler function to execute")
40-44:json.dumpsmay fail on non-serializable payload values.If
payloadcontains non-JSON-serializable types (e.g.,datetime,bytes, custom objects),json.dumpswill raiseTypeError. Consider wrapping in a try/except with a fallback (e.g.,repr(payload)) or use a custom encoder.Suggested defensive approach
def _generate_task_id(self, event_type: str, payload: dict[str, Any]) -> str: """Creates a unique hash for deduplication.""" - payload_str = json.dumps(payload, sort_keys=True) + try: + payload_str = json.dumps(payload, sort_keys=True, default=str) + except (TypeError, ValueError): + payload_str = repr(payload) raw_string = f"{event_type}:{payload_str}" return hashlib.sha256(raw_string.encode()).hexdigest()
89-96: Pending tasks are orphaned when workers stop.
stop_workerscancels workers immediately without draining the queue. Any tasks still inself.queueare lost. For graceful shutdown, consider awaitingself.queue.join()with a timeout before cancelling workers.Suggested graceful shutdown
async def stop_workers(self) -> None: - """Stops the background workers.""" + """Stops the background workers after draining the queue.""" if self.workers: + # Wait for queue to drain (with timeout) + try: + await asyncio.wait_for(self.queue.join(), timeout=30.0) + except asyncio.TimeoutError: + logger.warning("task_queue_drain_timeout", pending=self.queue.qsize()) for task in self.workers: task.cancel() await asyncio.gather(*self.workers, return_exceptions=True) self.workers.clear() logger.info("task_queue_workers_stopped")src/api/rate_limit.py (1)
34-45: Memory leak: Empty rate limit entries are never removed.After filtering expired timestamps, if the list is empty, the key remains in
_RATE_LIMIT_STORE. For high-traffic anonymous endpoints with many unique IPs, this can lead to unbounded memory growth until restart.♻️ Proposed fix
timestamps = _RATE_LIMIT_STORE.get(key, []) # Remove timestamps outside the window timestamps = [ts for ts in timestamps if now - ts < WINDOW] + + # Clean up empty entries to prevent memory leak + if not timestamps: + _RATE_LIMIT_STORE.pop(key, None) + if len(timestamps) >= limit: retry_after = int(WINDOW - (now - min(timestamps))) raise HTTPException( status_code=status.HTTP_429_TOO_MANY_REQUESTS, detail=f"Rate limit exceeded. Try again in {retry_after} seconds.", headers={"Retry-After": str(retry_after)}, ) timestamps.append(now) _RATE_LIMIT_STORE[key] = timestampssrc/api/auth.py (1)
51-51: Accessing private method_get_session()creates fragile coupling.Using
github_client._get_session()directly bypasses the public API and couples this code to internal implementation details. IfGitHubClientrefactors session management, this code will break.♻️ Suggested alternatives
Consider either:
- Exposing a public method on
GitHubClientfor token validation- Using
httpx.AsyncClientdirectly here since this is a one-off validation call# Option 2: Direct httpx usage async with httpx.AsyncClient(timeout=30.0) as client: async with client.get(url, headers=headers) as response: ...src/integrations/github/api.py (2)
274-284: Accessing private_loopattribute is fragile.Directly accessing
self._session._looprelies on aiohttp's internal implementation. While the try/except provides a fallback, this could break silently with aiohttp updates.Consider a simpler approach that doesn't rely on internals:
# Simpler: always recreate session if we can't verify it's usable try: if self._session is None or self._session.closed: self._session = aiohttp.ClientSession() except Exception: self._session = aiohttp.ClientSession()
1199-1289: Significant code duplication between_PR_HYGIENE_QUERY_20and_PR_HYGIENE_QUERY_10.These two GraphQL queries are nearly identical, differing only in
last: 20vslast: 10. Consider parameterizing the query or using a template.♻️ Proposed refactor
def _build_pr_hygiene_query(limit: int) -> str: return f""" query PRHygiene($owner: String!, $repo: String!) {{ repository(owner: $owner, name: $repo) {{ pullRequests(last: {limit}, states: [MERGED, CLOSED]) {{ nodes {{ number title body changedFiles mergedAt additions deletions author {{ login }} authorAssociation comments {{ totalCount }} closingIssuesReferences(first: 1) {{ totalCount nodes {{ title }} }} reviews(first: 10) {{ nodes {{ state, author {{ login }} }} }} files(first: 10) {{ edges {{ node {{ path }} }} }} }} }} }} }} """Then call
_build_pr_hygiene_query(20)or_build_pr_hygiene_query(10).src/api/repos.py (1)
34-36: Stub implementation always returnsinstalled: False.This endpoint currently returns a hardcoded response without actually checking installation status. The TODO indicates this needs implementation via the GitHub App API.
Would you like me to help implement the actual installation check using the
/app/installationsendpoint, or should I open an issue to track this work?src/agents/repository_analysis_agent/agent.py (1)
27-57: Docstring describes 3 steps but graph registers 5 nodes.The docstring mentions a 3-step flow (
fetch_metadata,fetch_pr_signals,generate_rules) but the actual graph has 5 nodes includinggenerate_reportandgenerate_reasonings. Consider updating the docstring to reflect the actual flow.Suggested docstring update
def _build_graph(self) -> Any: """ Constructs the state graph for the analysis workflow. Flow: - 1. `fetch_metadata`: Gathers static repo facts (languages, file structure). - 2. `fetch_pr_signals`: Analyzes dynamic history (PR hygiene, AI usage). - 3. `generate_rules`: Synthesizes data into governance recommendations. + 1. `fetch_metadata`: Gathers static repo facts (languages, file structure). + 2. `fetch_pr_signals`: Analyzes dynamic history (PR hygiene, AI usage). + 3. `generate_report`: Diagnoses problems based on hygiene metrics. + 4. `generate_rules`: Prescribes governance rules to address problems. + 5. `generate_reasonings`: Explains why each rule was recommended. Returns: Compiled StateGraph ready for execution. """src/main.py (1)
148-162: Multiple iterations over task collection can be optimized.The current implementation iterates over
tasksfour times (once per status count). For large task queues, a single pass would be more efficient.Suggested optimization
`@app.get`("/health/tasks", tags=["Health Check"]) async def health_tasks(): """Check the status of background tasks.""" - tasks = task_queue.tasks.values() - pending_count = sum(1 for t in tasks if t.status.value == "pending") - running_count = sum(1 for t in tasks if t.status.value == "running") - completed_count = sum(1 for t in tasks if t.status.value == "completed") - failed_count = sum(1 for t in tasks if t.status.value == "failed") + tasks = list(task_queue.tasks.values()) + counts = {"pending": 0, "running": 0, "completed": 0, "failed": 0} + for t in tasks: + status = t.status.value + if status in counts: + counts[status] += 1 return { "task_queue_status": "running", "workers": len(task_queue.workers), "tasks": { - "pending": pending_count, - "running": running_count, - "completed": completed_count, - "failed": failed_count, + **counts, "total": len(tasks), }, }src/agents/repository_analysis_agent/models.py (3)
11-26: Silent exception handling may mask parsing issues.The
except Exceptionat Line 24 silently catches all exceptions and returns the original identifier. While the comment explains this is intentional, consider logging a warning to aid debugging when parsing fails unexpectedly.Suggested improvement
+import structlog + +logger = structlog.get_logger() + def parse_github_repo_identifier(identifier: str) -> str: """ Normalizes various GitHub identifiers into 'owner/repo' format using giturlparse. Used by tests to verify repository strings. """ try: p = parse(identifier) if p.valid and p.owner and p.repo: return f"{p.owner}/{p.repo}" # Fallback for simple "owner/repo" strings that might fail strict URL parsing if "/" in identifier and not identifier.startswith(("http", "git@")): return identifier.strip().strip("/") return identifier except Exception: - # If parsing fails completely, return original to let validator catch it later + # If parsing fails completely, return original to let validator catch it later + logger.debug("github_identifier_parse_failed", identifier=identifier) return identifier
90-100: Consider adding validation for severity field.The
severityfield accepts any string but should be constrained to valid values:"low","medium","high", or"critical".Suggested validation
+from typing import Literal + class RuleRecommendation(BaseModel): """ Represents a single rule suggested by the AI. Contains only fields that go into rules.yaml file. """ description: str = Field(..., description="What the rule does") enabled: bool = Field(True, description="Whether the rule is enabled") - severity: str = Field("medium", description="low, medium, high, or critical") + severity: Literal["low", "medium", "high", "critical"] = Field( + "medium", description="Rule severity level" + ) event_types: list[str] = Field(..., description="Event types this rule applies to (e.g., ['pull_request'])") parameters: dict[str, Any] = Field(default_factory=dict, description="Rule parameters for validators")
136-138: Consider excludinguser_tokenfrom serialization to prevent accidental exposure.The
user_tokenfield contains a sensitive GitHub Personal Access Token. While it's needed during execution, it could be accidentally logged or serialized. Consider using Pydantic'sSecretStrorexclude=True.Suggested improvement
+from pydantic import SecretStr + class AnalysisState(BaseModel): # --- Inputs --- repo_full_name: str is_public: bool = False - user_token: str | None = Field( - None, description="Optional GitHub Personal Access Token for authenticated API requests" + user_token: str | None = Field( + None, + description="Optional GitHub Personal Access Token for authenticated API requests", + exclude=True, # Prevent accidental serialization/logging )src/api/recommendations.py (2)
275-289: Potential issue with AI Generated Rate metric handling.The
explanationlambda at Line 286 attempts to formatveven whenvmight beNone. While the condition at Line 287 checks forNone, theelseclause string doesn't usev, but the structure is error-prone. The current logic works but is confusing.Suggested clarification
{ "name": "AI Generated Rate", "key": "ai_generated_rate", "category": "Quality", "thresholds": {"high": 0.2, "medium": 0.1}, - "explanation": lambda v: ( + "explanation": lambda v: ( + "AI detection not available." + if v is None + else ( f"High AI-generated content ({v:.0%}) may indicate low-effort contributions." if v >= 0.2 else f"Some AI-generated content detected ({v:.0%}). Consider review processes." if v >= 0.1 else "Low AI-generated content rate. Contributions appear human-authored." + ) ) - if v is not None - else "AI detection not available.", },
510-510: Moveyamlimport to module level.Importing
yamlinside the function on every request is inefficient. Move it to the module-level imports.Suggested fix
# At module level (around line 1-7) +import yaml from typing import Any import structlog ... # Remove from inside recommend_rules function - # Step 6: Success—map agent state to the API response model. - final_state = result.data # The agent's execute method returns the final state - - # Generate rules_yaml from recommendations - # RuleRecommendation now includes all required fields directly - import yaml + # Step 6: Success—map agent state to the API response model. + final_state = result.data # The agent's execute method returns the final statesrc/agents/repository_analysis_agent/nodes.py (1)
164-172: Remove or reduce debug logging statements.Lines 165-167 and 172 contain debug logging that may be too verbose for production. Consider removing or changing to
logger.debug().Suggested cleanup
owner, repo_name = repo.split("/") - logger.info("debug_split_success", owner=owner, repo_name=repo_name) - logger.info("debug_client_type", client_type=str(type(github_client))) # Use user_token if provided for authenticated requests (higher rate limits). pr_nodes, pr_warning = await github_client.fetch_pr_hygiene_stats( owner=owner, repo=repo_name, user_token=state.user_token, installation_id=None ) - logger.info("debug_pr_nodes_fetched", count=len(pr_nodes)) + logger.debug("pr_nodes_fetched", count=len(pr_nodes))src/rules/validators.py (1)
168-206: Prefer_matches_anyto avoid over‑matching across directories.
_glob_to_regexmaps*to.*, which matches/and can make patterns likesrc/*.pymatch nested paths. You already have_matches_anywith proper*vs**semantics and caching—using it here improves correctness and consistency.♻️ Proposed refactor
- regex_pattern = FilePatternCondition._glob_to_regex(pattern) - - # Pattern match—performance: optimize if file count high. - matching_files = [file for file in changed_files if re.match(regex_pattern, file)] + # Pattern match—reuse shared glob matcher for correct ** semantics. + matching_files = [file for file in changed_files if _matches_any(file, [pattern])]
📜 Review details
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (16)
src/agents/repository_analysis_agent/agent.pysrc/agents/repository_analysis_agent/models.pysrc/agents/repository_analysis_agent/nodes.pysrc/agents/repository_analysis_agent/prompts.pysrc/api/auth.pysrc/api/dependencies.pysrc/api/errors.pysrc/api/rate_limit.pysrc/api/recommendations.pysrc/api/repos.pysrc/api/rules.pysrc/event_processors/pull_request.pysrc/integrations/github/api.pysrc/main.pysrc/rules/validators.pysrc/tasks/task_queue.py
🚧 Files skipped from review as they are similar to previous changes (1)
- src/api/rules.py
🧰 Additional context used
🧬 Code graph analysis (10)
src/agents/repository_analysis_agent/models.py (1)
src/core/models.py (2)
HygieneMetrics(66-123)repo_full_name(56-58)
src/api/repos.py (3)
src/core/utils/caching.py (1)
get(44-66)src/core/models.py (1)
repo_full_name(56-58)src/integrations/github/api.py (1)
get_repository(126-140)
src/main.py (4)
src/webhooks/handlers/deployment_protection_rule.py (2)
DeploymentProtectionRuleEventHandler(10-33)handle(16-33)src/webhooks/dispatcher.py (1)
register_handler(24-27)src/core/models.py (1)
EventType(24-38)src/tasks/scheduler/deployment_scheduler.py (1)
get_deployment_scheduler(365-370)
src/agents/repository_analysis_agent/agent.py (3)
src/agents/base.py (5)
AgentResult(16-29)BaseAgent(32-119)_build_graph(55-60)execute(114-119)_execute_with_timeout(93-111)src/agents/repository_analysis_agent/models.py (1)
AnalysisState(128-164)src/agents/repository_analysis_agent/nodes.py (1)
fetch_repository_metadata(28-150)
src/api/auth.py (2)
src/integrations/github/api.py (1)
_get_session(262-288)src/core/utils/caching.py (1)
get(44-66)
src/tasks/task_queue.py (2)
src/webhooks/handlers/issue_comment.py (1)
event_type(19-20)tests/unit/tasks/test_queue.py (1)
queue(13-15)
src/api/dependencies.py (2)
src/core/models.py (1)
User(7-18)src/integrations/github/service.py (1)
GitHubService(15-185)
src/event_processors/pull_request.py (2)
src/event_processors/push.py (1)
_format_check_run_output(184-228)src/core/models.py (1)
repo_full_name(56-58)
src/api/recommendations.py (4)
src/api/dependencies.py (1)
get_current_user_optional(23-43)src/api/rate_limit.py (1)
rate_limiter(25-45)src/core/models.py (2)
User(7-18)repo_full_name(56-58)src/integrations/github/api.py (5)
get_repository(126-140)get_git_ref_sha(831-845)create_git_ref(847-883)create_or_update_file(913-961)create_pull_request(963-995)
src/integrations/github/api.py (2)
src/core/errors.py (1)
GitHubGraphQLError(6-11)src/core/models.py (1)
repo_full_name(56-58)
🪛 Gitleaks (8.30.0)
src/integrations/github/api.py
[high] 131-131: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
[high] 147-147: Detected a Generic API Key, potentially exposing access to various services and sensitive operations.
(generic-api-key)
🔇 Additional comments (30)
src/event_processors/pull_request.py (1)
288-288: Context-aware check output is a good enhancement.Passing repo and installation context into the formatter makes the check output more actionable and consistent with the new guidance flow.
src/tasks/task_queue.py (2)
1-10: LGTM!Imports and logger initialization are appropriate for an async task queue with Pydantic models and structured logging.
99-100: LGTM!Module-level singleton is appropriate for application-wide task queue access.
src/api/errors.py (1)
1-22: LGTM!Clean implementation of a standardized error response model. The
retry_afterfield aligns well with the rate limiting functionality being added in this PR, and the optionaldetailsdict provides flexibility for additional error context.src/api/dependencies.py (2)
13-17: LGTM!Simple dependency injection pattern that enables easy mocking in tests. Creating a new instance per request is acceptable given
GitHubServicehas minimal initialization overhead.
46-54: LGTM!Correct implementation of strict authentication enforcement with appropriate 401 status code and clear error message.
src/integrations/github/api.py (1)
1136-1146: GitHub GraphQL API requires authentication for private/protected data; anonymous access to public data is allowed with rate limits.GitHub's GraphQL API supports unauthenticated requests for public repositories and data at a reduced rate limit of 60 requests per hour per IP. The code correctly implements this by returning appropriate headers for anonymous access without an
Authorizationheader. Anonymous requests will not receive a 401 error for public data; they will only fail if accessing private or protected resources.Likely an incorrect or invalid review comment.
src/agents/repository_analysis_agent/agent.py (3)
1-11: LGTM!Clean imports with structured logging via
structlog. The module organization follows the established patterns for LangGraph-based agents.
104-110: LGTM!Good exception handling with structured logging. The comment on Line 108 appropriately documents the intentional broad exception catch for top-level orchestration safety.
87-88: This concern does not apply to the current codebase.AnalysisStateuses Pydantic BaseModel with no explicitConfigDictor extra field configuration, so it defaults to Pydantic v2'sextra='ignore'. Any extra keys from LangGraph will be silently ignored, not raise aValidationError.Likely an incorrect or invalid review comment.
src/agents/repository_analysis_agent/prompts.py (1)
3-45: LGTM!Well-structured prompts with clear issue-to-validator mapping. The system prompt provides actionable guidance for the LLM, and the user prompt template includes all necessary context fields for rule generation.
src/main.py (2)
32-56: LGTM!Good structured logging configuration with JSON rendering for production observability. Silencing noisy HTTP libraries at WARNING level is appropriate.
59-103: LGTM!Well-structured lifespan manager with proper startup (workers, scheduler, handlers) and shutdown (scheduler stop, worker stop) sequences.
src/agents/repository_analysis_agent/models.py (2)
29-46: LGTM!Good use of
model_validatorfor automatic URL normalization. The flexible input handling (eitherrepository_full_nameorrepository_url) is user-friendly.
103-125: LGTM!Excellent documentation in the
PRSignaldocstring explaining the AI Immune System feature and how signals feed intoHygieneMetrics.src/api/recommendations.py (4)
22-63: LGTM!Well-documented Pydantic models with clear attribute descriptions. The
AnalysisResponsemodel provides a clean API contract for the frontend.
394-410: LGTM!Good URL validation with host verification. The explicit check for
github.comandwww.github.comprevents accidental use with other Git hosting providers.
592-640: LGTM!Well-documented authentication priority and comprehensive error handling. The auth validation at Lines 631-640 correctly requires at least one form of authentication with a clear error message.
665-772: LGTM!Robust PR creation flow with proper error handling at each step (repo fetch, SHA lookup, branch creation, file commit, PR creation). The branch existence check at Lines 704-718 handles race conditions gracefully.
src/agents/repository_analysis_agent/nodes.py (6)
17-25: LGTM!Good list of AI detection keywords for heuristic-based detection. The keywords cover common AI tool signatures.
28-150: LGTM!Comprehensive metadata gathering with good error handling around individual file fetches. The language detection map and workflow pattern heuristics are appropriate for this use case.
304-319: LGTM!Good defensive coding for review author extraction at Line 309. The use of
.get("author", {}).get("login", "")safely handles cases where the author may beNone(e.g., deleted GitHub users). This addresses the concern from the previous review.
442-552: LGTM!Well-structured LLM invocation with proper error handling. The fallback rule generation on failure (Lines 543-552) ensures graceful degradation. Good separation of concerns between system and user prompts.
555-609: LGTM!Good implementation of per-rule reasoning generation. The sequential LLM calls for each recommendation provide detailed explanations while the error handling ensures the workflow continues even if reasoning generation fails.
612-677: LGTM!Well-structured analysis report generation with appropriate LLM temperature (0.2) for consistent formatting. The fallback message on failure ensures the state always has a report.
src/rules/validators.py (5)
85-90: Comment-only change—no action needed.
132-147: Comment-only change—no action needed.
223-226: Comment-only change—no action needed.
246-260: Comment-only change—no action needed.
1078-1081: Registry update looks good.
New validator is wired into the registry as expected.
✏️ Tip: You can disable this entire section by setting review_details to false in your review settings.
| mentioned_files = [] | ||
| text = body + " " + title | ||
| # Look for file patterns (e.g., "src/file.py", "file.ts", etc.) | ||
| file_pattern = r"\b[\w/]+\.(py|ts|js|go|rs|java|rb|php|cpp|c|h|swift|kt|scala)\b" | ||
| matches = re.findall(file_pattern, text) | ||
| mentioned_files.extend([m[0] if isinstance(m, tuple) else m for m in matches]) | ||
|
|
||
| # Heuristic 2: Check if PR has linked issue but description is generic/unrelated | ||
| closing_issues = pr_node.get("closingIssuesReferences", {}) | ||
| has_linked_issue = closing_issues.get("totalCount", 0) > 0 | ||
|
|
||
| # If PR has linked issue but description is very short/generic, potential mismatch | ||
| if has_linked_issue and len(body.strip()) < 50: | ||
| issue_diff_mismatch_count += 1 | ||
| continue | ||
|
|
||
| # If specific files mentioned but don't match changed files, potential mismatch | ||
| if mentioned_files and changed_file_names: | ||
| # Check if any mentioned file matches changed files | ||
| matches_changed = any( | ||
| any( | ||
| mf in cf or cf in mf or mf.split("/")[-1] == cf.split("/")[-1] | ||
| for cf in changed_file_names | ||
| ) | ||
| for mf in mentioned_files | ||
| ) | ||
| if not matches_changed: | ||
| issue_diff_mismatch_count += 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regex pattern captures only file extension, not full file path.
The regex at Line 359 uses a capturing group (py|ts|...) which causes re.findall to return only the captured extensions, not the full file paths. The code at Line 361 then treats these extensions as file paths, making the comparison at Lines 375-381 ineffective.
Suggested fix
# Extract file mentions from description
mentioned_files = []
text = body + " " + title
# Look for file patterns (e.g., "src/file.py", "file.ts", etc.)
- file_pattern = r"\b[\w/]+\.(py|ts|js|go|rs|java|rb|php|cpp|c|h|swift|kt|scala)\b"
+ file_pattern = r"\b([\w/]+\.(?:py|ts|js|go|rs|java|rb|php|cpp|c|h|swift|kt|scala))\b"
matches = re.findall(file_pattern, text)
- mentioned_files.extend([m[0] if isinstance(m, tuple) else m for m in matches])
+ mentioned_files.extend(matches)🤖 Prompt for AI Agents
In `@src/agents/repository_analysis_agent/nodes.py` around lines 356 - 383, The
regex in variable file_pattern and the handling of matches is returning only the
captured extension (due to the capturing group) so mentioned_files contains
extensions instead of full paths; update file_pattern to use a non-capturing
group (e.g., replace "(py|ts|...)" with "(?:py|ts|...)" or otherwise ensure the
full match is returned), call re.findall to get full-path matches, and simplify
the population of mentioned_files by extending with matches directly (remove the
m[0] tuple handling); keep the subsequent comparison logic that uses
mentioned_files and changed_file_names and leave issue_diff_mismatch_count
increments unchanged.
| raise HTTPException( | ||
| status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, | ||
| detail=f"Failed to validate token: {error_text}", | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Error response may leak sensitive internal details.
Including raw error_text from GitHub's response in the client-facing error could expose internal information. Consider using a generic message while logging the details server-side.
🐛 Proposed fix
if response.status != 200:
error_text = await response.text()
logger.error("token_validation_failed", status=response.status, error=error_text)
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
- detail=f"Failed to validate token: {error_text}",
+ detail="Failed to validate token with GitHub API.",
)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| raise HTTPException( | |
| status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, | |
| detail=f"Failed to validate token: {error_text}", | |
| ) | |
| if response.status != 200: | |
| error_text = await response.text() | |
| logger.error("token_validation_failed", status=response.status, error=error_text) | |
| raise HTTPException( | |
| status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, | |
| detail="Failed to validate token with GitHub API.", | |
| ) |
🤖 Prompt for AI Agents
In `@src/api/auth.py` around lines 64 - 67, The HTTPException currently includes
raw error_text from GitHub in the response; change it to return a generic
client-facing message (e.g., "Failed to validate token") and move the raw
error_text to server-side logs instead: call the existing logger (e.g.,
logger.error or app_logger.error) to record error_text and any context before
raising the HTTPException without the sensitive details, keeping the
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR and referencing the same raise
HTTPException(...) site and the error_text variable for logging.
| except Exception as e: | ||
| logger.exception("token_validation_error", error=str(e)) | ||
| raise HTTPException( | ||
| status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, | ||
| detail=f"Failed to validate token: {str(e)}", | ||
| ) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception details exposed in error response.
Similar to above, str(e) may contain sensitive information. Use a generic message for the client response.
🐛 Proposed fix
except Exception as e:
logger.exception("token_validation_error", error=str(e))
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
- detail=f"Failed to validate token: {str(e)}",
+ detail="An unexpected error occurred during token validation.",
) from e🤖 Prompt for AI Agents
In `@src/api/auth.py` around lines 91 - 96, The except block currently returns an
HTTPException with the raw exception string (detail=f"Failed to validate token:
{str(e)}"), which can leak sensitive info; change the HTTPException detail to a
generic message (e.g., "Failed to validate token") while keeping the full
exception logged via logger.exception (reference symbols: logger.exception,
HTTPException, status.HTTP_500_INTERNAL_SERVER_ERROR) so internal logs preserve
details but clients receive no sensitive information.
| if user and user.email: | ||
| key = f"user:{user.email}" | ||
| limit = AUTH_LIMIT | ||
| else: | ||
| key = f"ip:{request.client.host}" | ||
| limit = ANON_LIMIT |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential AttributeError if request.client is None.
When running behind certain proxies or in test environments, request.client can be None, causing request.client.host to raise an AttributeError. Additionally, if user.email is None (allowed by the User model), the key becomes "user:None", which would incorrectly share rate limits across all users without emails.
🐛 Proposed fix
if user and user.email:
key = f"user:{user.email}"
limit = AUTH_LIMIT
+ elif user and user.id:
+ key = f"user:{user.id}"
+ limit = AUTH_LIMIT
else:
- key = f"ip:{request.client.host}"
+ client_ip = request.client.host if request.client else "unknown"
+ key = f"ip:{client_ip}"
limit = ANON_LIMIT🤖 Prompt for AI Agents
In `@src/api/rate_limit.py` around lines 27 - 32, The current rate-limit key logic
can raise AttributeError when request.client is None and incorrectly uses
"user:None" when user.email is None; update the branch in the rate limiting code
to only treat requests as authenticated when user and user.email are truthy (use
user.email explicitly), and compute client_host defensively (e.g., client_host =
request.client.host if request.client and getattr(request.client, "host", None)
else request.headers.get("x-forwarded-for", "unknown")), then set key =
f"user:{user.email}" and limit = AUTH_LIMIT only when user.email is present,
otherwise set key = f"ip:{client_host}" and limit = ANON_LIMIT; reference the
variables/user checks and AUTH_LIMIT/ANON_LIMIT in the change.
| client_ip = request.client.host if request.client else "unknown" | ||
| user_id = user.email if user else "Anonymous" | ||
|
|
||
| logger.info("analysis_requested", repo_url=repo_url_str, user_id=user_id, ip=client_ip) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider whether logging client IP constitutes PII.
Logging client_ip at Line 451 may be subject to privacy regulations (GDPR, CCPA). If this is a compliance concern, consider hashing or omitting the IP from logs.
🤖 Prompt for AI Agents
In `@src/api/recommendations.py` around lines 448 - 451, The log currently records
raw client IP via request.client.host into client_ip and passes it to
logger.info("analysis_requested", ...), which may be PII; change this by either
omitting the raw IP or replacing it with a non-reversible identifier (e.g., a
salted hash/HMAC using an application secret) before logging. Locate the
client_ip assignment and the logger.info call in recommendations.py (references:
client_ip, request.client.host, user.email, logger.info("analysis_requested"))
and update the code to compute and log a hashed_ip or simply exclude the IP
field based on a config flag (e.g., LOG_RAW_IPS=false), ensuring the hashing
uses a stable secret so the identifier can be correlated without exposing the
plain IP.
| except Exception as e: | ||
| logger.exception("installation_check_failed", repo=repo_full_name, error=str(e)) | ||
| raise HTTPException( | ||
| status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, | ||
| detail=f"Failed to check installation: {str(e)}", | ||
| ) from e |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception details exposed in error response.
Including str(e) in the error detail may leak sensitive internal information to API clients.
🐛 Proposed fix
except Exception as e:
logger.exception("installation_check_failed", repo=repo_full_name, error=str(e))
raise HTTPException(
status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
- detail=f"Failed to check installation: {str(e)}",
+ detail="Failed to check installation status.",
) from e📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except Exception as e: | |
| logger.exception("installation_check_failed", repo=repo_full_name, error=str(e)) | |
| raise HTTPException( | |
| status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, | |
| detail=f"Failed to check installation: {str(e)}", | |
| ) from e | |
| except Exception as e: | |
| logger.exception("installation_check_failed", repo=repo_full_name, error=str(e)) | |
| raise HTTPException( | |
| status_code=status.HTTP_500_INTERNAL_SERVER_ERROR, | |
| detail="Failed to check installation status.", | |
| ) from e |
🤖 Prompt for AI Agents
In `@src/api/repos.py` around lines 40 - 45, The handler currently exposes
internal exception text in the API response; keep detailed info in logs but
return a generic message to clients—remove str(e) from the HTTPException.detail
and replace with a non-sensitive string like "Failed to check installation" (or
similar), while retaining logger.exception("installation_check_failed",
repo=repo_full_name, error=str(e)) or better: keep
logger.exception("installation_check_failed", repo=repo_full_name) so the full
exception is recorded server-side; update the raise HTTPException call
(status.HTTP_500_INTERNAL_SERVER_ERROR) to use the generic detail and do not
include the exception string.
| async with session.get(url, headers=headers) as response: | ||
| if response.status == 200: | ||
| data = await response.json() | ||
| return data if isinstance(data, list) else [data] | ||
|
|
||
| # Raise exception for error statuses to avoid silent failures | ||
| response.raise_for_status() | ||
| return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unreachable code: return [] after raise_for_status().
When response.status != 200, raise_for_status() will raise an exception, making the subsequent return [] unreachable. This appears to be a logic error—either catch the exception and return [], or remove the unreachable return.
🐛 Proposed fix - consistent with other methods in this class
async with session.get(url, headers=headers) as response:
if response.status == 200:
data = await response.json()
return data if isinstance(data, list) else [data]
-
- # Raise exception for error statuses to avoid silent failures
- response.raise_for_status()
- return []
+ elif response.status == 404:
+ return []
+ else:
+ error_text = await response.text()
+ logger.error(
+ "list_directory_failed",
+ repo=repo_full_name,
+ path=path,
+ status=response.status,
+ error=error_text[:200],
+ )
+ return []🤖 Prompt for AI Agents
In `@src/integrations/github/api.py` around lines 153 - 160, The code after
response.raise_for_status() is unreachable; update the async GET block (the
"async with session.get(...)" scope that calls response.raise_for_status()) to
either remove the unreachable "return []" or catch the HTTP error and return []
for non-200 responses to match other methods in this class — e.g., wrap
response.raise_for_status() in a try/except for aiohttp.ClientResponseError (or
a broad Exception), logging or swallowing the error and returning [] from the
function instead of leaving a dead return statement.
| except httpx.HTTPStatusError as e: | ||
| logger.error( | ||
| "pr_fetch_http_error", | ||
| repo=repo_full_name, | ||
| status_code=e.response.status_code, | ||
| error_type="network_error", | ||
| error=str(e), | ||
| ) | ||
| return [] | ||
| except Exception as e: | ||
| logger.error("pr_fetch_unexpected_error", repo=repo_full_name, error_type="unknown_error", error=str(e)) | ||
| return [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Exception type mismatch: catching httpx.HTTPStatusError in aiohttp code.
This method uses aiohttp for HTTP requests but catches httpx.HTTPStatusError, which will never be raised. The aiohttp equivalent is aiohttp.ClientResponseError.
🐛 Proposed fix
- except httpx.HTTPStatusError as e:
+ except aiohttp.ClientResponseError as e:
logger.error(
"pr_fetch_http_error",
repo=repo_full_name,
- status_code=e.response.status_code,
+ status_code=e.status,
error_type="network_error",
error=str(e),
)
return []📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| except httpx.HTTPStatusError as e: | |
| logger.error( | |
| "pr_fetch_http_error", | |
| repo=repo_full_name, | |
| status_code=e.response.status_code, | |
| error_type="network_error", | |
| error=str(e), | |
| ) | |
| return [] | |
| except Exception as e: | |
| logger.error("pr_fetch_unexpected_error", repo=repo_full_name, error_type="unknown_error", error=str(e)) | |
| return [] | |
| except aiohttp.ClientResponseError as e: | |
| logger.error( | |
| "pr_fetch_http_error", | |
| repo=repo_full_name, | |
| status_code=e.status, | |
| error_type="network_error", | |
| error=str(e), | |
| ) | |
| return [] | |
| except Exception as e: | |
| logger.error("pr_fetch_unexpected_error", repo=repo_full_name, error_type="unknown_error", error=str(e)) | |
| return [] |
🤖 Prompt for AI Agents
In `@src/integrations/github/api.py` around lines 1098 - 1109, The except block is
catching httpx.HTTPStatusError in aiohttp-based code; change the exception type
to aiohttp.ClientResponseError and log the correct status attribute (use
e.status instead of e.response.status_code) so the logger call in this handler
(where logger.error("pr_fetch_http_error", repo=repo_full_name, status_code=...,
error_type="network_error", error=str(e))) reports the proper aiohttp error
details; keep the existing fallback except Exception block unchanged.
| async def validate(self, parameters: dict[str, Any], event: dict[str, Any]) -> bool: | ||
| check_commits = parameters.get("check_commits", True) | ||
|
|
||
| # Get PR data | ||
| pull_request = event.get("pull_request_details", {}) | ||
| if not pull_request: | ||
| return True # No violation if we can't check | ||
|
|
||
| # Check PR description for linked issues | ||
| body = pull_request.get("body", "") or "" | ||
| title = pull_request.get("title", "") or "" | ||
|
|
||
| # Check for closing keywords (closes, fixes, resolves, refs, relates) followed by issue reference | ||
| closing_keywords = ["closes", "fixes", "resolves", "refs", "relates", "addresses"] | ||
| issue_pattern = r"#\d+|(?:https?://)?(?:github\.com/[\w-]+/[\w-]+/)?(?:issues|pull)/\d+" | ||
|
|
||
| # Check description and title | ||
| text_to_check = (body + " " + title).lower() | ||
| has_linked_issue = False | ||
|
|
||
| # Check for closing keywords with issue references | ||
| for keyword in closing_keywords: | ||
| pattern = rf"\b{re.escape(keyword)}\s+{issue_pattern}" | ||
| if re.search(pattern, text_to_check, re.IGNORECASE): | ||
| has_linked_issue = True | ||
| break | ||
|
|
||
| # Also check for standalone issue references (e.g., #123) | ||
| if not has_linked_issue and re.search(issue_pattern, text_to_check): | ||
| has_linked_issue = True | ||
|
|
||
| # Check commit messages if requested | ||
| if not has_linked_issue and check_commits: | ||
| commits = event.get("commits", []) | ||
| if not commits: | ||
| # Try to get from pull_request_details | ||
| commits = pull_request.get("commits", []) | ||
|
|
||
| for commit in commits: | ||
| commit_message = commit.get("message", "") or "" | ||
| if not commit_message: | ||
| continue | ||
|
|
||
| commit_text = commit_message.lower() | ||
| for keyword in closing_keywords: | ||
| pattern = rf"\b{re.escape(keyword)}\s+{issue_pattern}" | ||
| if re.search(pattern, commit_text, re.IGNORECASE): | ||
| has_linked_issue = True | ||
| break | ||
|
|
||
| # Check for standalone issue references in commit | ||
| if not has_linked_issue and re.search(issue_pattern, commit_text): | ||
| has_linked_issue = True | ||
| break | ||
|
|
||
| if has_linked_issue: | ||
| break | ||
|
|
||
| logger.debug( | ||
| f"RequireLinkedIssueCondition: PR has linked issue: {has_linked_issue}, checked commits: {check_commits}" | ||
| ) | ||
| return has_linked_issue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
🌐 Web query:
GitHub pull_request webhook payload structure "commits" field type
💡 Result:
In the pull_request webhook event payload, the pull_request object is a “Pull Request” resource, where:
pull_request.commitsis an integer (the count of commits in the PR) [1]pull_request.commits_urlis a string (URL to list the PR’s commits) [1]
If you’re thinking of a commits array, that’s part of the push event payload (not pull_request), where commits is an array of commit objects [2].
Sources:
[1] PyGithub PullRequest object docs (fields: commits = integer, commits_url = string) (pygithub.readthedocs.io)
[2] GitHub Docs “GitHub event types” (PushEvent payload: commits = array) (docs.github.com)
Citations:
- 1: https://pygithub.readthedocs.io/en/v1.58.1/github_objects/PullRequest.html?utm_source=openai
- 2: https://docs.github.com/developers/webhooks-and-events/events?utm_source=openai
🌐 Web query:
GitHub pull_request webhook event payload commit messages full payload structure
💡 Result:
pull_request webhook event: where are commit messages?
The pull_request webhook payload does not include the PR’s commit list or commit messages. It includes a pull_request object (plus metadata like action, number, sender, repository, etc.). To get commit messages, you must make an API call to list the PR’s commits. [1] [2]
High-level payload shape (common fields)
Depending on the action (e.g., opened, edited, synchronize, closed, assigned, labeled, etc.), the payload includes some extra fields (like assignee, label, changes). The common shape is: [1]
{
"action": "opened|edited|synchronize|closed|...",
"number": 123,
"pull_request": {
"id": 1,
"number": 123,
"state": "open|closed",
"title": "…",
"body": "…",
"user": { "login": "…" },
"head": { "ref": "branch", "sha": "HEAD_SHA", "repo": { "full_name": "owner/repo" } },
"base": { "ref": "main", "sha": "BASE_SHA", "repo": { "full_name": "owner/repo" } }
// also includes many other PR fields (URLs, timestamps, flags, counts, etc.)
},
"repository": { "full_name": "owner/repo" },
"sender": { "login": "…" }
// optionally: "organization", "installation", "enterprise", and action-specific fields
}How to get commit messages for the PR
Call “List commits on a pull request”:
GET /repos/{owner}/{repo}/pulls/{pull_number}/commits
Each item includescommit.message. [2]
If you specifically want commit messages in the webhook payload
Use the push webhook event instead; it includes commits[] and each commit includes message. [3]
Sources: GitHub webhook pull_request payload docs [1], GitHub REST “List commits on a pull request” [2], GitHub push event payload (commits[][message]) [3].
Guard commits type before iterating.
event["commits"] does not exist in pull_request webhook events, and pull_request["commits"] is an integer (count of commits), not a list. Attempting to iterate over it will raise TypeError. Add type guards to normalize to a list or skip commit scanning.
🐛 Proposed fix
- commits = event.get("commits", [])
+ commits = event.get("commits", [])
+ if not isinstance(commits, list):
+ commits = []
if not commits:
# Try to get from pull_request_details
- commits = pull_request.get("commits", [])
+ commits = pull_request.get("commits", [])
+ if not isinstance(commits, list):
+ commits = []🤖 Prompt for AI Agents
In `@src/rules/validators.py` around lines 988 - 1049, In validate (async def
validate), guard and normalize the commits variable before iterating: when
obtaining commits from event.get("commits", []) or pull_request.get("commits",
[]), check isinstance(commits, list) (or iterable of dicts) and if not, set
commits = [] (or skip commit scanning) to avoid iterating an integer commit
count from pull_request; ensure the commit_message loop only runs when commits
is a list of dicts with "message" keys.
| async def _worker(self) -> None: | ||
| """Background worker loop.""" | ||
| while True: | ||
| task = await self.queue.get() | ||
| try: | ||
| # Periodic cleanup | ||
| if (datetime.now() - last_cleanup).total_seconds() > cleanup_interval: | ||
| self.cleanup_old_tasks() | ||
| last_cleanup = datetime.now() | ||
|
|
||
| # Find pending tasks | ||
| pending_tasks = [task for task in self.tasks.values() if task.status == TaskStatus.PENDING] | ||
|
|
||
| if pending_tasks: | ||
| # Process the oldest task | ||
| task = min(pending_tasks, key=lambda t: t.created_at) | ||
| await self._process_task(task, worker_name) | ||
| else: | ||
| # No tasks, wait a bit | ||
| await asyncio.sleep(1) | ||
|
|
||
| logger.info("task_started", task_id=task.task_id, event_type=task.event_type) | ||
| await task.func(*task.args, **task.kwargs) | ||
| logger.info("task_completed", task_id=task.task_id) | ||
| except Exception as e: | ||
| logger.error(f"Worker {worker_name} error: {e}") | ||
| await asyncio.sleep(5) | ||
|
|
||
| logger.info(f"Worker {worker_name} stopped") | ||
|
|
||
| async def _process_task(self, task: Task, worker_name: str): | ||
| """Process a single task.""" | ||
| try: | ||
| task.status = TaskStatus.RUNNING | ||
| task.started_at = datetime.now() | ||
|
|
||
| logger.info(f"Worker {worker_name} processing task {task.id}") | ||
|
|
||
| # Get the appropriate processor | ||
| processor = self._get_processor(task.event_type) | ||
| result = await processor.process(task) | ||
|
|
||
| task.status = TaskStatus.COMPLETED | ||
| task.completed_at = datetime.now() | ||
| task.result = result.__dict__ if hasattr(result, "__dict__") else result | ||
|
|
||
| logger.info(f"Task {task.id} completed successfully") | ||
|
|
||
| except Exception as e: | ||
| task.status = TaskStatus.FAILED | ||
| task.completed_at = datetime.now() | ||
| task.error = str(e) | ||
| logger.error(f"Task {task.id} failed: {e}") | ||
|
|
||
| def cleanup_old_tasks(self, max_age_hours: int = 24): | ||
| """Clean up old completed tasks and their event hashes to prevent memory leaks.""" | ||
| cutoff_time = datetime.now().timestamp() - (max_age_hours * 3600) | ||
|
|
||
| # Find old completed tasks | ||
| old_task_ids = [ | ||
| task_id | ||
| for task_id, task in self.tasks.items() | ||
| if task.status in [TaskStatus.COMPLETED, TaskStatus.FAILED] and task.created_at.timestamp() < cutoff_time | ||
| ] | ||
|
|
||
| # Remove old tasks and their event hashes | ||
| for task_id in old_task_ids: | ||
| task = self.tasks[task_id] | ||
| if task.event_hash and task.event_hash in self.event_hashes: | ||
| del self.event_hashes[task.event_hash] | ||
| del self.tasks[task_id] | ||
|
|
||
| if old_task_ids: | ||
| logger.info(f"Cleaned up {len(old_task_ids)} old tasks") | ||
|
|
||
| def _get_processor(self, event_type: str): | ||
| """Get the appropriate processor for the event type.""" | ||
| from src.event_processors.factory import EventProcessorFactory | ||
|
|
||
| return EventProcessorFactory.create_processor(event_type) | ||
|
|
||
|
|
||
| # Global task queue instance | ||
| logger.error("task_failed", task_id=task.task_id, error=str(e), exc_info=True) | ||
| finally: | ||
| self.queue.task_done() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Failed tasks are silently discarded.
When task.func raises an exception, the error is logged but the task is permanently lost with no retry or dead-letter mechanism. Combined with the pre-execution deduplication marking, failed tasks cannot be recovered. Consider implementing a bounded retry count or dead-letter queue for failed tasks.
🤖 Prompt for AI Agents
In `@src/tasks/task_queue.py` around lines 68 - 79, The worker currently logs
failures and drops tasks; modify _worker to support bounded retries and a
dead-letter queue: add a retry counter on the Task (e.g., task.retries and
task.max_retries) or maintain a separate retry map, and on exception increment
task.retries and if retries < max_retries re-enqueue the task (optionally with a
small backoff) otherwise push the task into a dead_letter_queue (e.g.,
self.dead_letter_queue) for inspection; ensure queue.task_done() is still called
once per dequeue and include task identifiers (task.task_id) when moving to
dead-letter so failed work can be recovered or inspected later.
Fixed pre-commit configuration and GitHub Actions workflow for Matas/fix/fix actions. This PR updates Ruff, enforces conventional commit messages, and enhances repository analysis with rate limiting and unauthenticated access for public repos. This improves code quality, governance, and usability.
Architecture & Core Changes
/v1/rules/recommendendpoint, limiting anonymous users to 5 requests per hour and authenticated users to 100 requests per hour.**kwargsin agentexecutemethods for MyPy strict mode compatibility.API & Data Changes
/v1/rules/recommendnow supports unauthenticated access for public GitHub repositories.AnalysisStatemodel torepository_analysis_agentfor shared memory between agent nodes.Performance & Reliability
Dependencies & Configuration
black,isort, andflake8withrufffor linting and formatting.pre-commithooks to userufffor linting and formatting.pyproject.tomlto use Ruff configuration.linting_commands.txtwith useful linting commands.giturlparse,structlog,gql[all].Summary by CodeRabbit
New Features
Documentation
Configuration
✏️ Tip: You can customize this high-level summary in your review settings.