-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Collect a repos network dependencies #9
Comments
Github doesn't have a dedicated API for this however I have found several tools that allows us to query for repo dependents. I'll investigate these tools to see what I can extract out of it. https://github.com/github-tooling/ghtopdep May also be worth pointing out that there is a NPM package dedicated to gathering NPM dependents. Although this ticket is meant for Github dependents I thought it worth mentioning. |
I have created a working script (branched off of #14) using ghtopdep and it works pretty well. We can grab repositories that use a particular Github repo library. This is EXTREMELY helpful when you want to find potential CVEs for misconfigured libraries. For example...
Also, we can filter these repositories dependent on the library based on a search term or based on the number of stars in the repo. For now I've decided to just filter based on the number of stars a repository has. Finally, as a sneak peek I've attached the python script I wrote. If all of this sounds good to you, I'll submit it as a PR once #14 is merged in. It's reliant on the code in #14. Any thoughts are appreciated here. Python scriptfrom typing import List
from lgtm import LGTMSite, LGTMDataFilters
import utils.cacher
import utils.github_api
import sys
import time
import subprocess
import json
def save_project_to_lgtm(site: 'LGTMSite', repo_name: str) -> dict:
print("About to save: " + repo_name)
# Another throttle. Considering we are sending a request to Github
# owned properties twice in a small time-frame, I would prefer for
# this to be here.
time.sleep(1)
repo_url: str = 'https://github.com/' + repo_name
project = site.follow_repository(repo_url)
print("Saved the project: " + repo_name)
return project
def run_command(command: str) -> str:
result = subprocess.check_output(command, shell=True)
return str(result)
def format_ghtopdep_output(output: str) -> str:
formatted_output = output.split("repositories\\r")[-1].strip()
output_size = len(formatted_output)
return formatted_output[:output_size - 5]
def get_network_dependency_graph_repos(repo_url: str) -> List[dict]:
ghtopdep_command = f"ghtopdep {repo_url} --json --minstar=5 --rows=10000"
raw_output = run_command(ghtopdep_command)
formatted_output = format_ghtopdep_output(raw_output)
repos = json.loads(formatted_output)
return repos
def save_project_to_lgtm(site: 'LGTMSite', repo_name: str) -> dict:
print("About to save: " + repo_name)
# Another throttle. Considering we are sending a request to Github
# owned properties twice in a small time-frame, I would prefer for
# this to be here.
time.sleep(1)
repo_url: str = 'https://github.com/' + repo_name
project = site.follow_repository(repo_url)
print("Saved the project: " + repo_name)
return project
def find_and_save_projects_to_lgtm(repo_library_url: str):
repos = get_network_dependency_graph_repos(repo_library_url)
saved_project_data: List[str] = []
site = LGTMSite.create_from_file()
github = utils.github_api.create()
for repo in repos:
repo_name = repo['url'].split("https://github.com/")[1]
time.sleep(2)
github_repo = github.get_repo(repo_name)
if github_repo.archived or github_repo.fork:
continue
saved_project = save_project_to_lgtm(site, github_repo.full_name)
time.sleep(2)
simple_project = LGTMDataFilters.build_simple_project(saved_project)
if not simple_project.is_valid_project:
continue
saved_data = f'{simple_project.display_name},{simple_project.key},{simple_project.project_type}'
saved_project_data.append(saved_data)
return saved_project_data
ghtopdep_help_output = run_command("ghtopdep --help")
if not "Usage: ghtopdep [OPTIONS] URL" in ghtopdep_help_output:
print("Please first install ghtopdep is required to run this script. Please see the ghtopdep for installation instructions: https://github.com/github-tooling/ghtopdep")
exit
repo_library_url = sys.argv[1]
saved_project_data = find_and_save_projects_to_lgtm(repo_library_url)
# If the user provided a second arg then they want to create a custom list.
if not len(sys.argv) > 2:
exit
custom_list_name = sys.argv[2]
utils.cacher.write_project_data_to_file(saved_project_data, custom_list_name) |
I've been researching potential queries that target misconfiguration of libraries. One feature Github has is the "Network Dependencies" feature. This provides a list of repos that utilize a given library. This is spectacular if there is a query targeting a particular use of a library. If we can figure out how to write a Github API call that collects these network dependency repos we would get more positive results.
It should be noted I have done zero research into if this API is offered by Github.
The text was updated successfully, but these errors were encountered: