Skip to content

A tool that uses several heuristics to try to get the licenses and copyright attributions of the 3rd party dependencies of a repository

License

Notifications You must be signed in to change notification settings

DataDog/dd-license-attribution

Repository files navigation

Datadog License Attribution Tracker

CI Linters OpenSSF Scorecard Coverage Python 3.11+ License Code style: black Imports: isort Type checker: mypy

Datadog License Attribution Tracker is a tool that collects license and copyright information for third party dependencies of a project and returns a list of said dependencies and their licenses and copyright attributions, if found.

As of today, Datadog License Attribution Tracker supports Go, Python, and NodeJS projects. It will be extended in the future to support more languages.

The tool collects license and other metadata information using multiple sources, including the GitHub API, pulled source code, the go-pkg list command output, and metadata collected from PyPI and NPM. It supports gathering data from various repositories to generate a comprehensive list of third party dependencies.

Runs may take minutes or hours depending on the size of the project dependency tree and the depth of the scanning.

Getting Started

  1. Install the required dependencies (see the Requirements section below)
  2. Clone this repository
  3. Install the package:
pip install .
  1. Run the tool on a GitHub repository:
dd-license-attribution generate-sbom-csv https://github.com/owner/repo > LICENSE-3rdparty.csv

For more advanced usage, see the sections below.

Available Commands

dd-license-attribution provides two main commands:

  1. generate-sbom-csv - Generate a CSV report (SBOM) of third-party dependencies
  2. generate-overrides - Interactively generate override configuration files

Run dd-license-attribution --help to see all available commands.

Requirements

  • python3.11+ - Python install instructions
  • libmagic (only on MacOS):
    • brew install libmagic
  • libuci (only on MacOS)
    • brew install icu4c && brew link icu4c --force

Optional Requirements

Usage

Generating SBOM Reports

To install and run the command after cloning the repository:

#starting at the root of the repository
pip install .

# Optionally you can define a GITHUB_TOKEN, if used it will raise the throttling threashold and maspeed up your generation calls to github APIs.
export GITHUB_TOKEN=YOUR_TOKEN
dd-license-attribution generate-sbom-csv https://github.com/owner/repo > LICENSE-3rdparty.csv

The following optional parameters are available for generate-sbom-csv:

Scanning Options

Scope Control
  • --only-transitive-dependencies: Extracts license and copyright from the passed package, only its dependencies.
  • --only-root-project: Extracts information from the licenses and copyright of the passed package, not its dependencies.
Strategy Selection
  • --deep-scanning: Enables intensive source code analysis using scancode-toolkit. This will parse license and copyright information from full package source code. Note: This is a resource-intensive task that may take hours or days to process depending on package size.
  • --no-pypi-strategy: Skips the strategy that collects dependencies from PyPI.
  • --no-gopkg-strategy: Skips the strategy that collects dependencies from GoPkg.
  • --no-github-sbom-strategy: Skips the strategy that gets the dependency tree from GitHub.
  • --no-npm-strategy: Skips the strategy that collects dependencies from NPM.
  • --no-scancode-strategy: Skips the strategy that gets licenses and copyright attribution using ScanCode Toolkit.

Cache Configuration

  • --cache-dir: if a directory is passed to this parameter all the dependencies source code downloaded for analysis is kept in the directory and can be reused between runs. By default, nothing is reused between runs.
  • --cache-ttl: seconds until cached data is considered expired, by default 1 day.

For more details about optional parameters pass --help to the command.

Output Format

The tool generates a CSV file with the following columns:

  • Component: The name of the dependency
  • Origin: The source URL of the dependency
  • License: The detected license(s)
  • Copyright: Copyright attribution(s) if found

Example output:

Component,Origin,License,Copyright
aiohttp,https://github.com/aio-libs/aiohttp,Apache-2.0,"aio-libs"
requests,https://github.com/psf/requests,Apache-2.0,"Kenneth Reitz"

Output string configuration

There's a file at src/dd_license_attribution/config/string_formatting_config.py that you can customize. It's used to help formatting of the "Copyright" part of the output. These are strings that often come after a comma (like the Inc in "Datadog, Inc.") that should be exceptions to splitting the string on the comma.

Manual repository override configuration

In some cases, the code we want to scan is not in the main branch of a github repository or we do not have access to it. For example, when we are reviewing a PR, or preparing one in our local machine. Or when we are evaluating alternative dependency sources. In those cases, we would like to replace what is used to be scanned for a particular github URL.

To do so, we can create a json file where we map full repositories to a mirror repository, and, optionally, remap internal references, as for example, to use my PR branch in place of the main branch.

  • --use-mirrors: Path to a JSON file containing mirror specifications for repositories. This is useful when you need to use alternative repository URLs to fetch source code. The JSON file should contain an array of mirror configurations, where each configuration has:
    • original_url: The original repository URL
    • mirror_url: The URL of the mirror repository
    • ref_mapping (optional): A mapping of references between the original and mirror repositories

Example mirror configuration file:

[
    {
        "original_url": "https://github.com/DataDog/test",
        "mirror_url": "https://github.com/mirror/test",
        "ref_mapping": {
            "branch:main": "branch:development",
            "tag:v1.0": "branch:development"
        }
    }
]

Note: Currently, only branch-to-branch mapping is supported. The mirror URLs must also be GitHub repositories.

Override Configuration

Sometimes dd-license-attribution may not detect all dependencies correctly, or the detected license information may be inaccurate. For these cases, you can provide an override configuration file to:

  • Fix incorrect license information detected by automated tools
  • Add related dependencies that weren't automatically discovered
  • Remove false positives from your dependency report
  • Update copyright information when the detected data is wrong
Creating Overrides Interactively (Recommended)

The easiest way to create overrides is using the interactive generate-overrides command:

# Generate the SBOM first
dd-license-attribution generate-sbom-csv https://github.com/owner/repo > LICENSE-3rdparty.csv

# Interactively fix entries with missing information
dd-license-attribution generate-overrides LICENSE-3rdparty.csv

# Regenerate with overrides applied
dd-license-attribution generate-sbom-csv https://github.com/owner/repo --override-spec .ddla-overrides > LICENSE-3rdparty.csv

The generate-overrides command will:

  • Analyze your CSV file for entries with missing license or copyright
  • Prompt you interactively to provide the correct information
  • Generate a properly formatted .ddla-overrides file

Options:

  • --output or -o: Specify custom output file location
  • --only-license: Only fix entries with missing license information
  • --only-copyright: Only fix entries with missing copyright information
Creating Overrides Manually

Alternatively, you can manually create an override configuration file:

Quick Example:

[
  {
    "override_type": "replace",
    "target": {"component": "package-name"},
    "replacement": {
      "name": "package-name",
      "license": ["MIT"],
      "copyright": ["Copyright 2024 Author"]
    }
  }
]

Then use it with the --override-spec parameter:

dd-license-attribution generate-sbom-csv --override-spec .ddla-overrides https://github.com/your-org/your-project

📖 For complete documentation, examples, and best practices, see Override Configuration Guide

Recommendation: When using overrides, consider creating a PR or feature request to improve dd-license-attribution or the target dependency to add missing information upstream. Overrides should ideally be a temporary measure.

Common Use Cases

Basic License Attribution

dd-license-attribution generate-sbom-csv https://github.com/owner/repo > LICENSE-3rdparty.csv

Deep Scanning with Caching

dd-license-attribution generate-sbom-csv --deep-scanning --cache-dir ./cache https://github.com/owner/repo > LICENSE-3rdparty.csv

Working with Private Repositories

export GITHUB_TOKEN=your_token
dd-license-attribution generate-sbom-csv https://github.com/owner/private-repo > LICENSE-3rdparty.csv

Using Mirror Repositories

# Create mirrors.json with your mirror configurations
dd-license-attribution generate-sbom-csv --use-mirrors=mirrors.json https://github.com/owner/repo > LICENSE-3rdparty.csv

Interactive Override Generation

# Step 1: Generate initial SBOM
dd-license-attribution generate-sbom-csv https://github.com/owner/repo > LICENSE-3rdparty.csv

# Step 2: Fix entries with missing information interactively
dd-license-attribution generate-overrides LICENSE-3rdparty.csv

# Step 3: Regenerate with overrides
dd-license-attribution generate-sbom-csv --override-spec .ddla-overrides https://github.com/owner/repo > LICENSE-3rdparty.csv

Development and Contributing

For instructions on how to develop or contribute to the project, read our CONTRIBUTING.md guidelines.

Current Development State

  • Initial set of dependencies is collected via github-sbom api, gopkg listing, and PyPI.
  • Action packages are ignored.
  • Python usage of PyPI metadata is limited to pure Python projects. If there are native dependencies or out-of-pypi requirements, failures are expected. The usage of the PyPI strategy can be disabled in those cases, but will reduce the coverage of the tool.

About

A tool that uses several heuristics to try to get the licenses and copyright attributions of the 3rd party dependencies of a repository

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Packages

No packages published

Contributors 6

Languages