Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Collect a repos network dependencies #9

Open
mrthankyou opened this issue Feb 4, 2021 · 2 comments
Open

Collect a repos network dependencies #9

mrthankyou opened this issue Feb 4, 2021 · 2 comments

Comments

@mrthankyou
Copy link
Contributor

I've been researching potential queries that target misconfiguration of libraries. One feature Github has is the "Network Dependencies" feature. This provides a list of repos that utilize a given library. This is spectacular if there is a query targeting a particular use of a library. If we can figure out how to write a Github API call that collects these network dependency repos we would get more positive results.

It should be noted I have done zero research into if this API is offered by Github.

@mrthankyou
Copy link
Contributor Author

mrthankyou commented Feb 22, 2021

Github doesn't have a dedicated API for this however I have found several tools that allows us to query for repo dependents. I'll investigate these tools to see what I can extract out of it.

https://github.com/github-tooling/ghtopdep

May also be worth pointing out that there is a NPM package dedicated to gathering NPM dependents. Although this ticket is meant for Github dependents I thought it worth mentioning.

@mrthankyou
Copy link
Contributor Author

I have created a working script (branched off of #14) using ghtopdep and it works pretty well. We can grab repositories that use a particular Github repo library. This is EXTREMELY helpful when you want to find potential CVEs for misconfigured libraries. For example...

# Template
# python3 follow_network_dependency_repos.py <GITHUB_LIBRARY_REPO_URL> <CUSTOM_LIST_NAME>

# This will find all repositories (with a minimum of 5 stars) that use the Electron Remote library. 
# We then cache the results so we can later move the repositories to the `remote-cache` LGTM custom list. 
python3 follow_network_dependency_repos.py https://github.com/electron/remote remote-cache

Also, we can filter these repositories dependent on the library based on a search term or based on the number of stars in the repo. For now I've decided to just filter based on the number of stars a repository has.

Finally, as a sneak peek I've attached the python script I wrote. If all of this sounds good to you, I'll submit it as a PR once #14 is merged in. It's reliant on the code in #14.

Any thoughts are appreciated here.

Python script
from typing import List
from lgtm import LGTMSite, LGTMDataFilters

import utils.cacher
import utils.github_api

import sys
import time
import subprocess
import json

def save_project_to_lgtm(site: 'LGTMSite', repo_name: str) -> dict:
    print("About to save: " + repo_name)
    # Another throttle. Considering we are sending a request to Github
    # owned properties twice in a small time-frame, I would prefer for
    # this to be here.
    time.sleep(1)

    repo_url: str = 'https://github.com/' + repo_name
    project = site.follow_repository(repo_url)
    print("Saved the project: " + repo_name)
    return project

def run_command(command: str) -> str:
    result = subprocess.check_output(command, shell=True)
    return str(result)

def format_ghtopdep_output(output: str) -> str:
    formatted_output = output.split("repositories\\r")[-1].strip()
    output_size = len(formatted_output)
    return formatted_output[:output_size - 5]

def get_network_dependency_graph_repos(repo_url: str) -> List[dict]:
    ghtopdep_command = f"ghtopdep {repo_url} --json --minstar=5 --rows=10000"
    raw_output = run_command(ghtopdep_command)
    formatted_output = format_ghtopdep_output(raw_output)
    repos = json.loads(formatted_output)
    return repos

def save_project_to_lgtm(site: 'LGTMSite', repo_name: str) -> dict:
    print("About to save: " + repo_name)
    # Another throttle. Considering we are sending a request to Github
    # owned properties twice in a small time-frame, I would prefer for
    # this to be here.
    time.sleep(1)

    repo_url: str = 'https://github.com/' + repo_name
    project = site.follow_repository(repo_url)
    print("Saved the project: " + repo_name)
    return project

def find_and_save_projects_to_lgtm(repo_library_url: str):
    repos = get_network_dependency_graph_repos(repo_library_url)
    saved_project_data: List[str] = []
    site = LGTMSite.create_from_file()

    github = utils.github_api.create()

    for repo in repos:
        repo_name = repo['url'].split("https://github.com/")[1]
        time.sleep(2)
        github_repo = github.get_repo(repo_name)

        if github_repo.archived or github_repo.fork:
            continue

        saved_project = save_project_to_lgtm(site, github_repo.full_name)
        time.sleep(2)

        simple_project = LGTMDataFilters.build_simple_project(saved_project)

        if not simple_project.is_valid_project:
            continue

        saved_data = f'{simple_project.display_name},{simple_project.key},{simple_project.project_type}'
        saved_project_data.append(saved_data)

    return saved_project_data


ghtopdep_help_output = run_command("ghtopdep --help")

if not "Usage: ghtopdep [OPTIONS] URL" in ghtopdep_help_output:
    print("Please first install ghtopdep is required to run this script. Please see the ghtopdep for installation instructions: https://github.com/github-tooling/ghtopdep")
    exit

repo_library_url = sys.argv[1]
saved_project_data = find_and_save_projects_to_lgtm(repo_library_url)

# If the user provided a second arg then they want to create a custom list.
if not len(sys.argv) > 2:
    exit

custom_list_name = sys.argv[2]
utils.cacher.write_project_data_to_file(saved_project_data, custom_list_name)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant