Skip to content

Filter unreleased phantom versions from registry build#65984

Merged
kaxil merged 1 commit into
apache:mainfrom
astronomer:registry-phantom-version-filter
Apr 28, 2026
Merged

Filter unreleased phantom versions from registry build#65984
kaxil merged 1 commit into
apache:mainfrom
astronomer:registry-phantom-version-filter

Conversation

@kaxil
Copy link
Copy Markdown
Member

@kaxil kaxil commented Apr 28, 2026

Summary

dev/registry/extract_metadata.py took the top entry of provider.yaml's versions: list as a provider's "latest" version with no verification that a real release tag exists. Two concrete failure modes shipped phantom pointers to the live registry:

  • providers/celery/provider.yaml lists 3.19.0 at the top, but only providers-celery/3.19.0rc1 and rc2 tags exist — no final. The registry would advertise celery 3.19.0 as latest while pip install apache-airflow-providers-celery resolves to 3.18.0.
  • providers/akeyless/ is brand-new in-tree with versions: [1.0.0] but no providers-akeyless/* tag. The provider would appear on the registry with broken outbound links to its non-existent PyPI release, GitHub tag, and docs page.

Design rationale

One-shot tag load: load_release_tags() runs git tag --list 'providers-*' once and returns a set, so per-provider lookups are O(1). Wave tags (providers/<YYYY-MM-DD>) don't match the glob (different prefix); only per-provider release tags are loaded.

Walk newest-first: find_latest_released_version walks versions: in document order and returns the first entry with a matching tag. All current provider.yaml files are newest-first.

Filter the list, not just the latest pointer: filters both Provider.version (singular, the latest) AND Provider.versions (the list). The latter is read by extract_versions.py's backfill, which would otherwise try to git show from non-existent phantom tags.

Skip vs. fallback: when no entry in versions: has a tag (truly unreleased provider), the provider is skipped from providers.json rather than emitted with version="0.0.0". The old 0.0.0 fallback shipped a registry page with broken outbound links — strictly worse than the page not existing.

--allow-unreleased opt-out for staging: maintainers preview newly-bumped versions on staging before tagging. The new flag bypasses the filter for staging dispatches and local dev. Default (no flag) = filter, which is correct for live. The registry-build.yml workflow auto-sets the flag when destination=staging.

Workflow change: registry-build.yml's checkout sets fetch-tags: true. The default fetch-depth: 1 checkout has no tags, which would silently return an empty set from load_release_tags() and trigger the same versions[0] fallback the rest of this PR is removing. registry-backfill.yml's primary checkout already uses fetch-depth: 0, so tags are present there.

Behaviour change

For destination=live (default):

  • Previously: provider with no tagged version → emitted with version="0.0.0" and broken links.
  • Now: provider with no tagged version → skipped from the registry entirely, with a per-provider "skipping" log line and an end-of-run summary.

For destination=staging:

  • No behaviour change — workflow passes --allow-unreleased, filter is bypassed, all providers including unreleased ones are emitted as before. Maintainers can preview on staging before tagging.

For breeze registry extract-data invoked locally without --allow-unreleased: same as live. Pass --allow-unreleased for local previewing.

Inspecting current main HEAD: exactly two providers fall into the new "skip" category for liveakeyless (brand-new, never released) and any provider where the latest pointer falls back (celery 3.19.03.18.0).

Gotchas

  • Run on a machine without git installed → load_release_tags() returns an empty set, falls back to the previous behaviour with a one-line warning. Acceptable for local dev; production runs through CI which always has git.
  • A bogus tag (e.g. providers-amazon/99.99.0 exists but provider.yaml doesn't exist at that commit) would still pass the existence check. Defending against that requires a git show validation per tag and is left as follow-up — current tag hygiene is good enough that no real provider has this issue today.

Known follow-ups (not in scope here)

  • extract_parameters.py runs against provider.yaml directly and doesn't yet honour the skip signal — an unreleased provider's modules can still leak into modules.json. Tracked separately.

`extract_metadata.py` took the top entry of `provider.yaml`'s `versions:`
list as a provider's "latest" version with no verification that a real
release tag exists. Provider release prep prepends the next version to
`versions:` BEFORE the tag lands, and pre-release-only versions match
`versions:` but have no final tag. Without filtering, the registry ships
phantom "latest" pointers to non-existent PyPI releases / GitHub tags /
docs pages.

Concrete cases this PR catches:

- `providers/celery/provider.yaml` lists `3.19.0` at the top, but only
  `providers-celery/3.19.0rc1` and `rc2` tags exist -- no final.
- `providers/akeyless/` is brand-new in-tree with `versions: [1.0.0]`
  but no `providers-akeyless/*` tag.

The fix loads all `providers-<id>/<version>` git tags once via
`git tag --list 'providers-*'`, walks each provider's `versions:` list
newest-first, picks the first entry with a matching tag for the singular
`version` (latest) field, and filters the `versions` (list) field to the
same tagged subset. Providers with NO version that has a matching tag are
skipped from the registry entirely (rather than emitted with phantom
pointers).

Also filters the `versions` list -- not just the singular `version` -- so
downstream consumers like `extract_versions.py`'s backfill don't try to
extract from non-existent tags.

`registry-build.yml`'s checkout now sets `fetch-tags: true`. Without it
the default `fetch-depth: 1` checkout has no tags, the filter silently
returns an empty set, and the script falls back to the unfiltered
behaviour. `registry-backfill.yml`'s primary checkout already uses
`fetch-depth: 0` so tags are present there.

Tests: TestLoadReleaseTags (3 cases: parsing, subprocess error, missing
git binary), TestFindLatestReleasedVersion (6 cases including phantom
top, RC-only, cross-provider mismatch, empty list), and
TestVersionsListFiltering (3 cases asserting the list is filtered in
parallel with the latest pointer).
@potiuk
Copy link
Copy Markdown
Member

potiuk commented Apr 28, 2026

Hmm. We should likely release celery ? possibly even automatically ?

@kaxil kaxil merged commit 38d8d41 into apache:main Apr 28, 2026
136 checks passed
@kaxil kaxil deleted the registry-phantom-version-filter branch April 28, 2026 01:06
@github-actions
Copy link
Copy Markdown
Contributor

Backport failed to create: v3-2-test. View the failure log Run details

Note: As of Merging PRs targeted for Airflow 3.X
the committer who merges the PR is responsible for backporting the PRs that are bug fixes (generally speaking) to the maintenance branches.

In matter of doubt please ask in #release-management Slack channel.

Status Branch Result
v3-2-test Commit Link

You can attempt to backport this manually by running:

cherry_picker 38d8d41 v3-2-test

This should apply the commit to the v3-2-test branch and leave the commit in conflict state marking
the files that need manual conflict resolution.

After you have resolved the conflicts, you can continue the backport process by running:

cherry_picker --continue

If you don't have cherry-picker installed, see the installation guide.

@vatsrahul1001
Copy link
Copy Markdown
Contributor

Manual backport #66902

vatsrahul1001 added a commit that referenced this pull request May 14, 2026
`extract_metadata.py` took the top entry of `provider.yaml`'s `versions:`
list as a provider's "latest" version with no verification that a real
release tag exists. Provider release prep prepends the next version to
`versions:` BEFORE the tag lands, and pre-release-only versions match
`versions:` but have no final tag. Without filtering, the registry ships
phantom "latest" pointers to non-existent PyPI releases / GitHub tags /
docs pages.

Concrete cases this PR catches:

- `providers/celery/provider.yaml` lists `3.19.0` at the top, but only
  `providers-celery/3.19.0rc1` and `rc2` tags exist -- no final.
- `providers/akeyless/` is brand-new in-tree with `versions: [1.0.0]`
  but no `providers-akeyless/*` tag.

The fix loads all `providers-<id>/<version>` git tags once via
`git tag --list 'providers-*'`, walks each provider's `versions:` list
newest-first, picks the first entry with a matching tag for the singular
`version` (latest) field, and filters the `versions` (list) field to the
same tagged subset. Providers with NO version that has a matching tag are
skipped from the registry entirely (rather than emitted with phantom
pointers).

Also filters the `versions` list -- not just the singular `version` -- so
downstream consumers like `extract_versions.py`'s backfill don't try to
extract from non-existent tags.

`registry-build.yml`'s checkout now sets `fetch-tags: true`. Without it
the default `fetch-depth: 1` checkout has no tags, the filter silently
returns an empty set, and the script falls back to the unfiltered
behaviour. `registry-backfill.yml`'s primary checkout already uses
`fetch-depth: 0` so tags are present there.

Tests: TestLoadReleaseTags (3 cases: parsing, subprocess error, missing
git binary), TestFindLatestReleasedVersion (6 cases including phantom
top, RC-only, cross-provider mismatch, empty list), and
TestVersionsListFiltering (3 cases asserting the list is filtered in
parallel with the latest pointer).

(cherry picked from commit 38d8d41)

Co-authored-by: Kaxil Naik <kaxilnaik@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:dev-tools area:registry backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

3 participants