Skip to content

Conversation

@FelipeFMA
Copy link

@FelipeFMA FelipeFMA commented Dec 1, 2025

/claim #634

Fixes Issue

Closes #634

Changes proposed

This PR addresses a resource leak in GitPolicyFetcher where repository objects were remaining in the internal repos cache even after a fetch operation failed (e.g., due to network issues or HTTP 500 errors from the git provider). This behavior could lead to "zombie" processes or lingering file descriptors/symbolic links as described in the issue.

Specific changes:

  • Modified packages/opal-server/opal_server/git_fetcher.py:
    • Wrapped the fetch operation within fetch_and_notify_on_changes in a try...except block.
    • Added logic to explicitly remove the repository from the GitPolicyFetcher.repos dictionary if an exception occurs during fetching.
    • This ensures that broken or unreachable repositories do not persist in memory/cache, allowing for a clean retry on the next cycle.

Check List (Check all the applicable boxes)

  • I sign off on contributing this submission to open-source
  • My code follows the code style of this project.
  • My change requires changes to the documentation.
  • I have updated the documentation accordingly.
  • All new and existing tests passed.
  • This PR does not contain plagiarized content.
  • The title of my pull request is a short description of the requested changes.

Video

opal.fix.mp4

Note to reviewers

I verified this fix using a reproduction script that simulates a remote git failure (HTTP 500). Without the fix, the repository object remained in the cache. With the fix, the repository is correctly removed from the cache.

import asyncio
import os
import shutil
from pathlib import Path
import logging
import sys
import git

# Add packages to path
sys.path.append(str(Path(__file__).parent / "packages/opal-server"))
sys.path.append(str(Path(__file__).parent / "packages/opal-common"))

from opal_server.git_fetcher import GitPolicyFetcher, PolicyFetcherCallbacks
from opal_common.schemas.policy_source import GitPolicyScopeSource, NoAuthData

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

class MockCallbacks(PolicyFetcherCallbacks):
    async def on_update(self, old_head, head):
        pass

async def run_test():
    base_dir = Path("./temp_test_opal")
    if base_dir.exists():
        shutil.rmtree(base_dir)
    base_dir.mkdir()

    # Create a dummy local repo to serve as remote
    remote_repo_dir = Path("./temp_remote_repo")
    if remote_repo_dir.exists():
        shutil.rmtree(remote_repo_dir)
    remote_repo_dir.mkdir()
    
    # Initialize repo
    r = git.Repo.init(remote_repo_dir)
    # Create a commit
    (remote_repo_dir / "policy.rego").write_text("package system")
    r.index.add(["policy.rego"])
    r.index.commit("Initial commit")

    # Use the local repo as source
    source = GitPolicyScopeSource(
        source_type="GIT",
        url=str(remote_repo_dir.absolute()),
        branch="master",
        directories=["."],
        auth=NoAuthData()
    )

    fetcher = GitPolicyFetcher(
        base_dir=base_dir,
        scope_id="test_scope",
        source=source,
        callbacks=MockCallbacks()
    )

    # First fetch should succeed
    logger.info("Performing initial fetch...")
    await fetcher.fetch_and_notify_on_changes(force_fetch=True)
    logger.info("Initial fetch done.")

    # Now change the remote URL in the local repo config to a bad HTTP URL
    # We use GitPython to modify the config
    repo_path = fetcher._repo_path
    repo = git.Repo(str(repo_path))
    bad_url = "https://httpstat.us/500/repo.git"
    repo.remotes.origin.set_url(bad_url)
    
    # Update fetcher source URL to match, so verification passes
    fetcher._source.url = bad_url

    # Initial FD count
    initial_fds = len(os.listdir(f"/proc/{os.getpid()}/fd"))
    logger.info(f"Initial FDs: {initial_fds}")

    repo_path_str = str(fetcher._repo_path)

    for i in range(20):
        # Check cache status
        # Note: We can't easily access GitPolicyFetcher.repos from here if it's not exposed, 
        # but we imported the class so we can access the static attribute.
        
        try:
            await fetcher.fetch_and_notify_on_changes(force_fetch=True)
        except Exception as e:
            # logger.info(f"Fetch failed as expected: {e}")
            pass
        
        if repo_path_str in GitPolicyFetcher.repos:
             logger.warning("Repo still in cache!")
        else:
             logger.info("Repo removed from cache.")

        current_fds = len(os.listdir(f"/proc/{os.getpid()}/fd"))
        logger.info(f"Iteration {i+1}: FDs: {current_fds}")

    final_fds = len(os.listdir(f"/proc/{os.getpid()}/fd"))
    logger.info(f"Final FDs: {final_fds}")

    if final_fds > initial_fds + 2: # Allow small fluctuation
        logger.error(f"FD leak detected! {final_fds - initial_fds} leaked FDs.")
        exit(1)
    else:
        logger.info("No FD leak detected.")

if __name__ == "__main__":
    asyncio.run(run_test())

@netlify
Copy link

netlify bot commented Dec 1, 2025

Deploy Preview for opal-docs canceled.

Name Link
🔨 Latest commit 35d3855
🔍 Latest deploy log https://app.netlify.com/projects/opal-docs/deploys/692e04845edc360008d0c126

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

OPAL Server doesn't clean up symbolic links when github is down

1 participant