Releases · clearlydefined/crawler

11 Jun 02:00

elrayle

v2.1.0

1715ccb

v2.1.0 Latest

Latest

Release Highlights

Release tag: v2.1.0

This release includes a single configuration change that allows deploys to specify the location of the queues that hold input for the crawler.

Upgrade Notes

No Action Required.

Optionally, you can set any of the new environment variables. See sections below that describe how these are used.

CRAWLER_HARVESTS_QUEUE_VISIBILITY_TIMEOUT_SECONDS (default 0) - The default is the same as the hardcoded value in the previous release. Recommended value is 5.
CRAWLER_LICENSEE_PARALLELISM (default 10) - The default is the same as the hardcoded value in the previous release
CRAWLER_AZBLOB_IS_SPN_AUTH (default false) - boolean which needs to be true if using SPN auth
CRAWLER_AZBLOB_SPN_AUTH (default ‘’) - The Azure SPN for the Azblob will only be used if CRAWLER_AZBLOB_IS_SPN_AUTH is true
CRAWLER_QUEUE_AZURE_IS_SPN_AUTH (default false) -boolean needs to be true if using SPN auth
CRAWLER_QUEUE_AZURE_SPN_AUTH (default ‘’) - The Azure SPN for the queue account will only be used if CRAWLER_QUEUE_AZURE_IS_SPN_AUTH is true

What’s changed

Changes: v2.0.1..v2.1.0

Minor Changes

Add support for visibility timeout in AzureQueueStore

Add support for visibility timeout in AzureQueueStore (#639) @qtomlinson
Updated the default of CRAWLER_HARVESTS_QUEUE_VISIBILITY_TIMEOUT_SECONDS (#642) @qtomlinson

There are 4 tools the crawler runs each of which generate tool results for package-version coordinates. As tools complete, a message is placed on the harvest queue identifying the coordinates and tool. The service will compute the definition if the current definition does not already have the results for that tool (at the tool’s version). If the tools complete spread across time, there is the potential for the service to compute the definition for the coordinates up to 4 times.

The addition of a visibility timeout for the AzureQueueStore allows messages to be hidden for a specified duration after being pushed onto the harvest queue. This allows for more of the crawler results to complete before the service computes the definition. This has the potential to reduce the number of definition computes from 4 down to 1. This enhancement works in conjunction with the improved definition computation on the service side, reducing the number of definition computations when component harvest results are available.

The visibility timeout for the queue store is controlled by the CRAWLER_HARVESTS_QUEUE_VISIBILITY_TIMEOUT_SECONDS environment variable. The current default is set to 300 seconds (5 minutes). To maintain the existing behavior of not hiding messages upon adding them to the queue, set CRAWLER_HARVESTS_QUEUE_VISIBILITY_TIMEOUT_SECONDS to 0.

Support SPN Authentication while maintaining Connection String approach for backward compatibility

Upgrade Azure Storage SDK to a modern version (#629) @RomanIakovlev, @ljones140
Separate crawler queue connection from harvest (#633) @ljones140

This change is a prerequisite to enable token-based authentication for the Azure Storage operations (blobs and queues). It updates the SDK to a modern version that is currently supported. See Issue #622 for information on why this change was necessary.

Old approach:

Prior to this change the connection string for both blobs and queues was used for authentication and was set in the environment variable CRAWLER_AZBLOB_CONNECTION_STRING. This approach is still allowed, but not recommended.

New approach:

The recommended best practice is to use SPN for authentication. As part of this work, authentication can now be done separately for blob storage and queues, or you can use the same SPN for both. To use this for authentication, set up the following environment variables.

To use SPN authentication for Azure Blob storage, set the following two environment variables..

CRAWLER_AZBLOB_IS_SPN_AUTH (default false) - boolean needs to be true if using SPN auth
CRAWLER_AZBLOB_SPN_AUTH (default ‘’) - Azure SPN for the Azblob will only be used if CRAWLER_AZBLOB_IS_SPN_AUTH is true

To use SPN authentication for Azure queues storage, set the following two environment variables..

CRAWLER_QUEUE_AZURE_IS_SPN_AUTH (default false) - boolean needs to be true if using SPN auth
CRAWLER_QUEUE_AZURE_SPN_AUTH (default ‘’) - Azure SPN for the queue account will only be used if CRAWLER_QUEUE_AZURE_IS_SPN_AUTH is true

You can use different SPN for each or use the same for both.

Update fetch file to centralize default headers (#620) @yashkohli88

Refactor all fetch functions into fetch.js and use the centralized user-agent header across fetch functions for various package managers.

Persist manifest information for sourcearchive components (#625) @qtomlinson

Update Tool Version for mavenExtract and nugetExtract (#631) @yashkohli88

Tool versions updated to 1.3.1 for mavenExtract and 1.2.3 for nugetExtract

Make licensee's max degree of parallelism configurable (#641) @jkbschmid

The max degree of parallelism for the licensee tool was hardcoded to 10. This adds CRAWLER_LICENSEE_PARALLELISM environment variable to allow for more control of crawler behavior. The default is 10 to maintain backward compatibility.

Patch Updates

Use the latest @clearlydefined/spdx 0.1.10 (#643) @qtomlinson

Contributors

RomanIakovlev, ljones140, and 3 other contributors

Assets 2

01 Nov 10:42

elrayle

v2.0.1

a1d12ac

v2.0.1

What's Changed

update workflow to operations/v3.2.0 which adds latest tag for Docker Hub by @elrayle in #624

Full Changelog: v2.0.0...v2.0.1

Contributors

elrayle

Assets 2

28 Oct 16:40

elrayle

v2.0.0

284d8f8

v2.0.0

Release tag: v2.0.0

Upgrade Notes

No steps are required to upgrade to this release as a user of ClearlyDefined. Any local harvesters will need to get the latest crawler image from Docker Hub and restart their crawler.

All major changes are related to data output changes brought in by updates to license identification tools and the license extraction process.

Note: Requests for definitions do not initiate a harvest request when a definition already exists. A harvest request is required to update raw tool results from which the definition will be constructed. Note as well that harvesting takes significant time. There will be a delay from the time the harvest request is made before the results will be reflected in a definition request.

What’s changed

Major Changes

Update license detection tools

Update licensee scan tool updated from v9.12 to v9.16.1 by @yashkohli88 in #549
Update scancode-toolkit from v30.1.0 to v32.1.0 by @lumaxis in #537

Modifications to ClearlyDefined license extraction

Update PodExtract tool version by @qtomlinson in #566
Derive license from info.license over classifiers in pypi registry data by @qtomlinson in #586

Minor Changes

New traversal policy

Introduce “reharvestAlways” traversal policy to make re-harvest simpler by @qtomlinson in #598

New “reharvestAlways” policy behavior:

When the tool result for a component is available, the tool will be rerun and tool result updated, similar to the "always" policy.
When the tool result for a component is not available, the component will be fetched and the tool will be run. This differs from the “always” policy which skips running when the results do not already exist.

Other minor changes

Remove rimraf by @lumaxis in #558
Update spdx parsing which includes support for passing in LicenseRef map by @ljones140 in #606

Bug Fixes and Patches

Development related

add sha and version to ‘/‘ endpoint by @elrayle in #574
Fix fetching latest version for some pod components by @qtomlinson in #588
Make scancode parallelism configurable by @RomanIakovlev in #612

DevOps

Deploy production crawler to Clearly Defined’s Azure account, along with MSFT by @ljones140 in #608
Deploys to dev on master merge by @ljones140 in #601
Deploy dev crawler via GitHub action by @ljones140 in #599
tests should run for changes in prod and have the option to run manually by @elrayle in #592
Add separate workflow step for testing Docker build by @lumaxis in #580
docs: add SECURITY.md by @nickvidal in #584

Dependencies

Bump express from 4.18.2 to 4.19.2 by @dependabot in #564
Bump debug from 4.1.1 to 4.3.5 by @dependabot in #581
Bump braces and patch-package by @dependabot in #582
Updated deprecated dependency request-promise-native by @yashkohli88 in #576
Cleanup dependencies by @lumaxis in #557

New Contributors

@nickvidal made their first contribution in #584
@ljones140 made their first contribution in #591

Full Changelog: v1.2.0...v2.0.0

Contributors

lumaxis, RomanIakovlev, and 6 other contributors

Assets 2

13 Aug 12:58

elrayle

v1.2.0

2a5e0a5

v1.2.0

Release Highlights

Release tag: v1.2.0

This release includes a single configuration change that allows deploys to specify the location of the queues that hold input for the crawler.

Upgrade Notes

No Action Required. Optionally, you can set a configuration to control where input queues will be constructed.

What’s changed

Changes: v1.1.0..v1.2.0

Minor Changes

Configure location of queues

Make Crawler queue in Azure separate from Azure results storage (#591) (@ljones140)

Release v1.2.0 adds the support of running Crawler queues in a separate Azure
account as the results storage blobs.

Requirement came from organizations that want to submit results to clearlydefinedprod Azure but don't want to have the queues in the same Azure account.

The crawler configuration takes an additional env var CRAWLER_QUEUE_AZURE_CONNECTION_STRING
If provided the crawler will use this storage account for the queues.
If not provided it will use same connection defined in CRAWLER_AZBLOB_CONNECTION_STRING

Bug Fixes and Patches

docs: add SECURITY.md (#584) (@nickvidal)
Add separate job for testing Docker build (#580) (@lumaxis)

Contributors

lumaxis, ljones140, and nickvidal

Assets 2

16 May 03:05

elrayle

v1.1.0

fcf83f0

v1.1.0

Release Highlights

Release tag: v1.1.0

There is one change of interest:

Conda was added as a package manager source. Details on usage are provided below under the Add Conda support section.

Upgrade Notes

No Action Required. Optionally, you can start requesting harvests for Conda packages. See details below.

What’s changed

Changes: v1.0.2..v1.1.0

Minor Changes

Add Conda support

There is one significant change in this release to add support for Conda package manager. It is classified as minor because it is additive. It does not impact the functioning of previously supported package managers.

conda crawler implementation (#532) (@lamarrr)

Conda exposes packages in a different format from other Python repositories like pypi. Conda is a Python environment locked to a specific Python version. It deals with packages locked to a specific version for a version of the channel, this ensures packages do not break due to one incompatibility or another as the packages are managed for compatibility, similar to how you'd ship a docker container.

The primary consumption point is the "packages" themselves which are accompanied with scripts to modify the environment and setup the packages and dependencies which are then consumed by the setup application. The packages may also contain DLLs, scripts, compiled Python binary (.pyc), python code. etc.

The structure of Conda repositories and their indexing process are described in Channels and generating an index (Conda docs).

Conda has three main channels: anaconda-main, anaconda-r, and conda-forge which is more geared toward business uses

We crawl both the packages and the source code (not always specified) for the licensing metadata and other metadata about the package.

The source from which the Conda packages are created is often, but not always, provided via a URL that links a compressed source file hosted externally, sometimes via GitHub, or another website. Note that this is a file and not a git repository.

The main Conda package is hosted on the Conda channels themselves and is compressed and contains necessary licensing information, compilers, environment configuration scripts, dependencies, etc. that are needed to make the package work.

Coordinates syntax:

type (required) - identifies to use the Conda provider (values: conda | condasource)
provider (required) - channel on which the package will be crawled. (values: conda-forge | anaconda-main | anaconda-r)
namespace (optional) - architecture and OS of the package to be crawled (e.g. win64, linux-aarch64). If no architecture is specified, any architecture is chosen.
package name (required): name of the package
revision (optional): package version and optional build version (format: (${version} | )-(${buildversion} | )) (e.g. 0.3.0, 0.3.0-py36hffe2fc). If it is a conda coordinate type, the build version of the package is usually a conda-specific representation of the build tools and environment configuration, and build iteration of the package (e.g. for a Python 3.9 environment, buildversion is py39H443E). If none is specified, the latest one will be selected using the package's timestamp.

Examples:

conda/conda-forge/linux-aarch64/numpy/1.13.0
condasource/conda-forge/linux-aarch64/numpy/1.13.0
conda/conda-forge/-/numpy/1.13.0/
conda/conda-forge/linux-aarch64/numpy/-py36

Conda-forge is a community effort and packages are published by opening PRs on their GitHub repository as described in Contributing packages (Conda Forge docs).

Bug Fixes and Patches

Development related

Pin reuse version to the most recent 3.0.1 (#559) (@qtomlinson)
Fix ENOENT error during harvesting Conda components (#575) (@qtomlinson)

DevOps

Improve formatting and linting setup (#538) (@lumaxis)

Dependencies

Bump follow-redirects from 1.15.5 to 1.15.6 (#562) (@dependabot[bot])

Contributors

lumaxis, lamarrr, and 2 other contributors

Assets 2

13 May 17:03

elrayle

v1.0.2

63526a8

v1.0.2

Release Highlights

Release tag: v1.0.2

This is a patch release with bug fixes.

Upgrade Notes

No Action Required

What’s changed

Changes: v1.0.1..v1.0.2

Bug Fixes and Patches

Bug Fixes

Rename variable for consistency (#570) (@lumaxis)
Fix URL to fetch Go packages with latest version (#569) (@yashkohli88)

Contributors

lumaxis and yashkohli88

Assets 2

27 Apr 13:26

elrayle

v1.0.0

9ac0a7e

v1.0.0

Release v1.0.0 is a re-release of the current production crawler which was last released Dec 5, 2022. There was a release recently on Apr 2, 2024. This was triggered by a merge of master into prod. I would expect this to be release 1.1.0. The purpose of the v1.0.0 release is to establish a known baseline as the starting point for the transition to using Semantic Versioning for the released versions. The purpose of the v1.1.0 release is to capture the changes that exist in master at this moment in time.

Releases are published as Docker images to Docker Hub. Future releases will be published to Docker Hub and GitHub Packages.

Release Highlights

Release tag: v1.0.0

NOTE: The version in package.json differs from the release tag because it was previously set and could not be changed.

Breaking Changes

none

Upgrade Notes

No Action Required

What’s changed

This release is identical to the code that has been the production release since Dec 5, 2022.

previous-release:

tag: v0.1.1 tagged but not published as a release
date: 8-3-2022

Changes: v0.1.1..v1.0.0

Assets 2

27 Apr 14:26

elrayle

v1.0.1

5465642

v1.0.1

Release Highlights

Release tag: v1.0.1

This is a patch release with bug fixes, dependency updates, documentation improvements, and devops maintenance related to running tests.

Upgrade Notes

No Action Required

What’s changed

Changes: v1.0.0..v1.0.1

Bug Fixes and Patches

Bug Fixes

Fix extracting license information for pypi packages (#518) (@qtomlinson)
Fix harvesting git components (#517) (@qtomlinson)
fixing bundler install error by locking verison (#512) (@mpcen)
Exclude .git directory content when calculating package file count (#525) (@qtomlinson)
lowercasing package names for nuget api fetching (#515) (@mpcen)

Documentation

Update README.md - fix docker run for Mac OS (#560) (@yashkohli88)
Update readme - describe request type (#526) (@qtomlinson)

Update Dependencies

Update Node version and dependencies (#522) (@lumaxis)
Bump @babel/traverse from 7.12.9 to 7.23.9 (#545) (@dependabot[bot])
Bump follow-redirects from 1.15.1 to 1.15.5 (#544) (@dependabot[bot])
Bump axios from 0.27.2 to 1.6.0 (#543) (@dependabot[bot])
Bump xml2js from 0.4.23 to 0.5.0 (#542) (@dependabot[bot])
Bump luxon from 2.3.0 to 2.5.2 (#511) (@dependabot[bot])
update @clearlydefined/spdx to 0.1.7 (#530) (@qtomlinson)

DevOps/Maintenance

Delete azure-pipelines.yml (#556) (@lumaxis)
Add GitHub Actions workflow to run tests (#551) (@lumaxis)

Contributors

lumaxis, mpcen, and 3 other contributors

Assets 2

Releases: clearlydefined/crawler

v2.1.0

Release Highlights

Upgrade Notes

What’s changed

Minor Changes

Add support for visibility timeout in AzureQueueStore

Support SPN Authentication while maintaining Connection String approach for backward compatibility

Update fetch file to centralize default headers (#620) @yashkohli88

Persist manifest information for sourcearchive components (#625) @qtomlinson

Update Tool Version for mavenExtract and nugetExtract (#631) @yashkohli88

Make licensee's max degree of parallelism configurable (#641) @jkbschmid

Patch Updates

Contributors

Uh oh!

v2.0.1

What's Changed

Contributors

Uh oh!

v2.0.0

Upgrade Notes

What’s changed

Major Changes

Update license detection tools

Modifications to ClearlyDefined license extraction

Minor Changes

New traversal policy

Other minor changes

Bug Fixes and Patches

New Contributors

Contributors

Uh oh!

v1.2.0

Release Highlights

Upgrade Notes

What’s changed

Minor Changes

Configure location of queues

Bug Fixes and Patches

Contributors

Uh oh!

v1.1.0

Release Highlights

Upgrade Notes

What’s changed

Minor Changes

Add Conda support

Bug Fixes and Patches

Contributors

Uh oh!

v1.0.2

Release Highlights

Upgrade Notes

What’s changed

Bug Fixes and Patches

Contributors

Uh oh!

v1.0.0

Release Highlights

Breaking Changes

Upgrade Notes

What’s changed

Uh oh!

v1.0.1

Release Highlights

Upgrade Notes

What’s changed

Bug Fixes and Patches

Contributors

Uh oh!