Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Distinguish between ABOUT files found in the codebase vs. curations #834

Open
pombredanne opened this issue Jul 31, 2023 · 11 comments
Open
Assignees

Comments

@pombredanne
Copy link
Member

In the d2d pipeline, the "curations" are ABOUT files designed to document code on the To side and are stored in the "From" side

There could also be "regular" ABOUT files in any of the codebases that would have been created by the maintainers of the repo to document code that lives side-by-side to these ABOUT files, like is done in SCTK or SCIO.

These are different!!!

@pombredanne
Copy link
Member Author

@AyanSinhaMahapatra ping... this needs design.

@tdruez
Copy link
Contributor

tdruez commented Jul 31, 2023

These are different!!!

How do we know what type (curations/regular) we are dealing with?

@AyanSinhaMahapatra
Copy link
Member

How do we know what type (curations/regular) we are dealing with?

From @pombredanne this could be either:

  1. some different extension from .ABOUT which has the curations
  2. or we introduce a new attribute in the ABOUT spec to distinguish the two

I think the later is much easier and cleaner, maybe something like an is_curated attribute could work. Then we can distinguish these cases.

@pombredanne
Copy link
Member Author

Another consideration is that a curation may be for the current directory or it could be for a target deployed file. We should have a way to track these.
An idea:

  • we use a flag is_curation set to yes or no
  • the about_resource: should be for the current local resource (e.g. on the development From/ Side)
  • we add a new path that would be the "deployed_resource" or something along these lines

@AyanSinhaMahapatra AyanSinhaMahapatra self-assigned this Sep 25, 2023
AyanSinhaMahapatra added a commit to aboutcode-org/aboutcode-toolkit that referenced this issue Sep 27, 2023
Adds two new attributes:
- deployed_resource
- is_curated
These are used to curate resources in deployed code.

Reference: aboutcode-org/scancode.io#834
Signed-off-by: Ayan Sinha Mahapatra <[email protected]>
@pombredanne
Copy link
Member Author

Here are a few more thoughts on this issue:

about_resource: '*/glowroot/*'
ignored_resources:
  - '*/glowroot/plugins/*'
name: glowroot
version: 0.13.6

We can now include and exclude patterns using about_resource and ignored_resources. This is not yet in AboutCode Toolkit BUT this is already used in ScanCode.io

An ABOUT file is like a package manifest. Or it is like a mini PurlDB.

The cases we have are:

  1. an ABOUT file present in the codebase designed to document the files present there. The about_resource field point to these existing files. This is the easy case

  2. OR, an ABOUT file present in the codebase designed to document files not yet present, such as files that will be fetched or compiled or created as a result of the build. This is the case we need and use in d2d pipelines today and the more complex case.

Issues arise when we are mixing the two cases as we cannot distinguish if an ABOUT file is/was present the codebase and designed to be used as missing package manifest or this is designed to be used as deployed codebase curation/override.

Therefore we need to determine either if an ABOUT file is of type 1. or 2. and this can be done these many different ways:

  • In the ABOUT files

    • A. implicitly based on match paths (this may be brittle). We could have ABOUT paths prefixed with to/ or from/ .... but this would be weird.
    • B. explicitly with an attribute (like is_curated but another name). We really want to state which "codebase" (deployment or development) an ABOUT file applies to.
    • C.explicitly with a deployed path pattern attribute : this would be a fairly general purpose option that would be usable by SCIO and other tools too. Since we have already "ignore_patterns", we would therefore need to also have ignored for deployed patterns. Making this more complex
  • Externally in ScanCode.io

    • D. designate with a setting that from/ ABOUT files are meant to be type 2. This may be complex since we have these type 1. and 2. can be mixed possible. An another possibility could be to designate the type 2. files with patterns.
    • E. always use a specific separate archive to upload in the project when using ABOUT files as type 2. Or use a pattern to point to which ones are which.

Solutions B. and E. are the most appealing at first.

The B. way seems to be the most appealing and general purpose approach.
If we go with this an ABOUT file about deployed code be looking like this:

about_resource: '*/glowroot/*'
ignored_resources:
  - '*/glowroot/plugins/*'
codebase: deployment
name: glowroot
version: 0.13.6

Another ABOUT file to document the source side could be that (and we could have both):

about_resource: '*/src/glowroot/*'
codebase: development
name: glowroot
version: 0.13.6

In this case, codebase: development would be the default and not needed.

With this it would not matter WHERE the ABOUT file lives in the codebase, it would be usable in all places. Existing ABOUT files would be still valid. So there is no compatibility breakage. And there is no need for solution E.

There are interesting cases that this approach enables:

  1. we can have different licenses/copyright and details for the source and the deployed code
    This is common to have a source with GPL and LGPL, but we deploy only the LGPL library
  2. we can document multiple deployed places. Here we just add another ABOUT file for each place.

@mjherzog
Copy link
Member

For case 2 (an ABOUT file present in the codebase designed to document files not yet present, such as files that will be fetched or compiled or created as a result of the build. This is the case we need and use in d2d pipelines today and the more complex case.) I am concerned that terms like development and deployment are not sufficient (and not well-understood/agreed outside AboutCode). Don't we want to distinguish between a package that is vendored in the development codebase vs one that is fetched from a separate repository?

@pombredanne
Copy link
Member Author

re:

I am concerned that terms like development and deployment are not sufficient (and not well-understood/agreed outside AboutCode). Don't we want to distinguish between a package that is vendored in the development codebase vs one that is fetched from a separate repository?

That's what we are trying to do here. We could therefore have multiple ABOUT files, and some may be to document code that is in the current codebase while some fetched/built and found only in some other codebase, typically the deployed binaries.
Here the idea is to say we have an ABOUT file that document code found elsewhere, and maybe developmment/deployed may not be explicit and obvious enough alright, and then we can something like any of these beyond "deployed", though "deployed" is fairly common IMHO:

  • binary
  • deployed-binaries
  • distributed
  • published
  • build
  • release
  • debug
  • shipped

Eventually, we could leave to the user to decide what this means, but I would rather come with a few simple and obvious codebase names.

@pombredanne
Copy link
Member Author

See also https://www.cisa.gov/sites/default/files/2023-04/sbom-types-document-508c.pdf

Table 1: SBOM Type Definition and Composition
Design
Source
Build
Analyzed
Deployed
Runtime

We could use a subset of these as a list. This is not perfect, but is published and exists.
The difficulty is that these are not entirely obvious to me, so I would likely only support source (default) and deployed

With this definition:

  • about_codebase: An ABOUT file describes a package and its resources in a codebase of a certain "type". This codebase type is based on the "SBOM types" defined by CISA. We support only these values for a codebase type in the about_codebase field: source and deployed. It is optional and defaults to source.

    • The about_codebase: source codebase is the default. It refers to the development environment, typically the version control control code checkouts, the source files, and the included or embedded dependencies present in the source code used to build the code. In this case, the resources documented by this ABOUT file are expected to be present in this codebase. ABOUT files are commonly stored in source and co-located with the code they document.

    • The about_codebase: deployed codebase refers to the deployment environment, such as the code that is deployed to run on a system or device (typically a "production" system) with all its effective bundled dependencies and packages as compiled, built, copied, or used in such a running system. In this case, the resources documented by this ABOUT file are expected to be present in the deployed codebase, but may not be present in the current, source codebase. In effect, ABOUT files are commonly stored in the source codebase with about_codebase: deployed to document code that is present in deployed codebase such as code that is fetched or provisioned during the build (like a Apache Maven JAR or a JavaScript npm), or to document the deployed subset of the a larger source package, as is common with Linux "userland" utilities composed of a library and a command tool when only the library is deployed.

@chinyeungli @mjherzog @DennisClark @AyanSinhaMahapatra ping, feedback welcomed.

@AyanSinhaMahapatra
Copy link
Member

@pombredanne
Agree with you on #834 (comment) that option B. is the most appealing here, as we discussed, and I now like about_codebase: source or about_codebase: deployed much more here (than codebase: development and codebase: deployed) in the comment above.

@mjherzog
Copy link
Member

If you read the CISA type descriptions closely I think that we should be using "Build" rather than "Deployed:. The description for Build is: "SBOM generated as part of the process of building the software to create a releasable artifact (e.g., executable or package) from data such as source files, dependencies, built components, build process ephemeral data, and other SBOMs."
The CISA type for Deployed "- Highlights software components installed on a system, including other configurations and
system components used to run an application." which seems to include runtime components like a JVM, database, app server, etc.
I think that it is good to consider the CISA definitions, but it may not be a good framework for what we are trying to do here.

@mjherzog
Copy link
Member

We need to come up with good definitions here and use them to improve our AboutCode-internal terminology about codebases - the CISA types are for SBOMs which seems to be a different use case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants