Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disable broken and outdated CI #39467

Open
wants to merge 9 commits into
base: develop
Choose a base branch
from

Conversation

tobiasdiez
Copy link
Contributor

@tobiasdiez tobiasdiez commented Feb 6, 2025

Many of the CI runs after a new release are failing, for months now, see eg https://github.com/sagemath/sage/actions/runs/12979684199/job/36218126145. Some of these failures are genuine (eg a certain package cannot be built on a certain system) and some others are due to constraints of the build system (eg running out of free space). Since there is very little point in senselessly burning energy, all runs that were failing for the last releases are disabled. Once the underlying issues are fixed, they can be easily be re-enabled.

Moreover, the "minimal" runs where only a couple of systems packages are installed and most are build using sage are removed, keeping only the "maximal" where all available system packages are installed.

New test run: https://github.com/tobiasdiez/sage/actions/runs/13199372232/job/36847711005

📝 Checklist

  • The title is concise and informative.
  • The description explains in detail what this PR is about.
  • I have linked a relevant issue or discussion.
  • I have created tests covering the changes.
  • I have updated the documentation and checked the documentation preview.

⌛ Dependencies

Copy link

github-actions bot commented Feb 6, 2025

Documentation preview for this PR (built with commit b8639d7; changes) is ready! 🎉
This preview will update shortly after each push to this PR.

@user202729
Copy link
Contributor

There's a risk of we just forget to reenable it. Is there a good way to do this dynamically (i.e. check which tests the last release fail and skip these)?

@tobiasdiez
Copy link
Contributor Author

There's a risk of we just forget to reenable it.

I'll open an issue where we can track which systems are failing, so we don't forget it. The issue tracker might actually be a more visible place in the first place as most people will not check the CI runs.

Is there a good way to do this dynamically (i.e. check which tests the last release fail and skip these)?

Not that I know.

@tobiasdiez tobiasdiez marked this pull request as ready for review February 7, 2025 12:04
@tobiasdiez tobiasdiez requested review from dimpase and kwankyu February 7, 2025 12:04
@dimpase
Copy link
Member

dimpase commented Feb 7, 2025

Great, thanks for taking care of this.

@dimpase
Copy link
Member

dimpase commented Feb 7, 2025

There's a risk of we just forget to reenable it. Is there a good way to do this dynamically (i.e. check which tests the last release fail and skip these)?

we had a heated argument at some point whether 'minimal' configurations are meaningful beyond someone's curiosity - and they are certainly the longest ones (building gcc, python, and a bunch of equally mainstream packages from sources, that's just a contribution to global warming)
I think they should be just gone.

@kwankyu
Copy link
Collaborator

kwankyu commented Feb 8, 2025

Removing things in CI may break other things elsewhere. Especially the developer guide. In particular, https://doc-release--sagemath.netlify.app/html/en/developer/portability_testing. Did you check?

The docker images created by CI linux are used by developers. The sage binder repo (https://github.com/sagemath/sage-binder-env) is using them (in particular the "minimal" image). The interactive sage doc relies on them. This PR will break the doc.

Changes affecting developers widely should be discussed or at least posted on sage-devel. This PR falls in the category.

all runs that were failing for the last releases are disabled. Once the underlying issues are fixed, they can be easily be re-enabled.

We do not remove code for temporary change. Do not invent a new protocol that code is removed to be disabled.

Improving CI is good, but the present PR is also destructive. I object to this PR.

@kwankyu kwankyu added disputed PR is waiting for community vote, see https://groups.google.com/g/sage-devel/c/IgBYUJl33SQ and removed s: positive review labels Feb 8, 2025
@dimpase
Copy link
Member

dimpase commented Feb 8, 2025

@kwankyu What exactly is broken by this PR?
Please point it out. This PR is removing CI bits which have been broken for a long time. There is nothing wrong with this.

@dimpase
Copy link
Member

dimpase commented Feb 8, 2025

And if you think this PR needs work, please label it so, not "disputed".

@dimpase dimpase added s: needs info and removed disputed PR is waiting for community vote, see https://groups.google.com/g/sage-devel/c/IgBYUJl33SQ labels Feb 8, 2025
@user202729
Copy link
Contributor

Technically you're right in that this merely remove part of CI that has been broken by some other change made months ago (thus doesn't "in addition" break anything), and reverting this is as simple as a git revert, but I think there is a small disadvantage in deleting the code entirely (compared to e.g. just comment it out): people unfamiliar with the change, coming across the file, would have no indication that something was temporarily disabled.

I don't personally use the minimal configurations and have no strong opinion on whether they're actually useful to developers. Of course having more configurations lead to additional maintenance burden and spent computing power (such as the maintenance cost needed now to make the minimal configuration build again), the cost may not be justified by the usefulness to developers/users.

Anyway, since the intention of this PR is to temporarily disable the run until the bug is fixed, can't we just comment it out?

I mean, we're all volunteers, if anyone volunteers to fix the minimal configurations then of course that will supersede this pull request and everyone are happy [1]… but if nobody does, the feature unfortunately will remain broken.

[1] maybe except those who prefer the minimal configurations to be deleted anyway, but I think the usual workflow is deleting a feature requires 1-year deprecation period and/or announcement of some sort.

@dimpase
Copy link
Member

dimpase commented Feb 8, 2025

One doesn't need a deprecation period to remove broken code, and minimal configurations are

  • mostly broken

  • make installation harder, especially for people who don't know better

  • increase maintenance burden (several times a month I have to tell a misguided user on sage-support or sage-devel to please install compilers and/or Python system-wide, and do not torture themselves and us with attempts to build these from source).

  • bloat the codebase of sage the distro with heaps of packages readily available either in the OS, or in a semi standard add-on such as Homebrew, or on PyPI

  • are as un-pythonic as one can imagine (no popular python project commits this sort of stupidity)

Why do we keep killing the project this way, just because few people find these minimal configs interesting or something?

@kwankyu
Copy link
Collaborator

kwankyu commented Feb 8, 2025

@kwankyu What exactly is broken by this PR? Please point it out. This PR is removing CI bits which have been broken for a long time. There is nothing wrong with this.

Did you see the section of the developer guide that I pointed out?

The removed chunks of ci-linux are responsible to push various docker images. One of the image, ubuntu-jammy-standard, is used in the CI check workflows for PRs. This PR will collapse the whole CI checks.

Other images, ubuntu-jammy-minimal-with-targets and ubuntu-jammy-minimal-with-system-packages, are used in sage-binder-env. This PR will collapse it and the interactive sage doc that is using it.

@user202729
Copy link
Contributor

user202729 commented Feb 8, 2025

Can we not escalate this, thanks.

Anyway, I think a good compromise is

  • just merge this
  • open an issue to track which systems it fails on (i.e. equivalent to saying "it is a known bug that the build fails on system X")
  • if someone comes forward with a pull request to fix it later we should incorporate it (the burden is on them to show it works well across multiple operating systems, and we may add a big red warning that tool X or package Y is better installed with operating system installer)

Do you think this is reasonable? (anyway I'm not a maintainer and would be happy with anything)

p/s. @kwankyu , I don't really completely understand what's going on, but if the particular CI has been failing for months, how come the incremental things etc. still work (this is test-new right?) Or does this PR also remove working runs?

# Make sure that all "standard" jobs can start simultaneously,
# so that runners are available by the time that "default" starts.
max_parallel: 50

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this? What is broken?

fi; \\
rm -rf /sage/src; \\
mv src /sage/src; \\
cd /sage && ./bootstrap && ./config.status; \\
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way what is this part supposed to do? As I read retrofit-worktree does

    echo >&2 "usage: $0 WORKTREE_NAME WORKTREE_DIRECTORY"
    echo >&2 "Ensures that the current working directory is a git repository,"
    echo >&2 "then makes WORKTREE_DIRECTORY a git worktree named WORKTREE_NAME."

but I don't understand why. Your commit would make sense if you assume the retrofit-worktree always fail (but why would it?)

max_parallel: 8

minimal:
if: ${{ success() || failure() }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this?

There are broken platforms. We may selectively turn them off until fixed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"gentoo-python3.10" is dead, as Gentoo doesn't ship python 3.10 any more (and 3.11 will soon be gone, too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this?

See PR description.

maximal-pre:
if: ${{ success() || failure() }}
needs: [minimal]
uses: ./.github/workflows/docker.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this?

There are broken platforms. We may selectively turn them off until fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this?

This is not deleted, only renamed to maximal.


optional:
if: ${{ success() || failure() }}
needs: [maximal-pre]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this?

We may turn this off until fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this?

We may turn this off until fixed.

You cannot turn single steps off as far as I know. And all of them are currently failing.

targets_optional: '$(echo $(export PATH=build/bin:$PATH && sage-package list :optional: --has-file "spkg-install.in|spkg-install|requirements.txt" --no-file "huge|has_nonfree_dependencies" | grep -v sagemath_doc | grep -v ^_))'

experimental:
if: ${{ success() || failure() }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why delete this?

We may turn this off until fixed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above


needs: [dist]

runs-on: ${{ matrix.os }}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this also failing? I cannot find a run.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kwankyu
Copy link
Collaborator

kwankyu commented Feb 8, 2025

Anyway, I think a good compromise is

  • just merge this
    ...
    Do you think this is reasonable?

No.

p/s. @kwankyu , I don't really completely understand what's going on, but if the particular CI has been failing for months,

CI-linux runs portability tests on supported platforms. As you can see in

https://github.com/sagemath/sage/actions/runs/12979684199

some jobs like "standard" fail for some platforms while some other jobs like "optional" fail for all patforms.

We should selectively turn them off until fixed, not remove the jobs.

how come the incremental things etc. still work (this is test-new right?)

because the "default" job running tests on ubuntu-jammy-standard succeeds, its docker image is pushed, and the incremental things are using the docker image to test PRs incrementally.

Or does this PR also remove working runs?

Yes.

@dimpase
Copy link
Member

dimpase commented Feb 8, 2025

ubuntu-jammy-minimal-with-targets and ubuntu-jammy-minimal-with-system-packages why minimal in Binder? What's the point of building e.g. gfortran and python3 from source there?

@kwankyu
Copy link
Collaborator

kwankyu commented Feb 8, 2025

Sorry. They are irrelevant questions. The point is that those docker images have been a part of our infrastructure. Removing the infrastructure should not be done without community-wide discussion.

@dimpase
Copy link
Member

dimpase commented Feb 8, 2025

Sorry. They are irrelevant questions. The point is that those docker images have been a part of our infrastructure. Removing the infrastructure should not be done without community-wide discussion.

No, the point it that there is no need to use these particular images. Standard images would do just as well. Please don't tell me that we must keep this minimal nonsense due to historical reasons.

@user202729
Copy link
Contributor

Sorry. They are irrelevant questions. The point is that those docker images have been a part of our infrastructure. Removing the infrastructure should not be done without community-wide discussion.

No, the point it that there is no need to use these particular images. Standard images would do just as well. Please don't tell me that we must keep this minimal nonsense due to historical reasons.

That's true, but I think that isn't in the scope of this pull request. If you want to migrate the generation of the docker images to a standard image you can open a separate one (which sounds like a good thing because the docker image is available more quickly, right?)

For this pull request, @kwankyu 's argument is we should not take advantage of the failing runs to also remove the successful runs. Whether -minimal should be removed can be discussed later.

@kwankyu
Copy link
Collaborator

kwankyu commented Feb 8, 2025

Sorry. They are irrelevant questions. The point is that those docker images have been a part of our infrastructure. Removing the infrastructure should not be done without community-wide discussion.

No, the point it that there is no need to use these particular images. Standard images would do just as well.

Please don't tell me that I chose minimal images just out of my frivolity. For Binder, creating lightest image was my priority.

For the scope of this PR: I agree that portability-test jobs that fails on all platforms should stop wasting energy until fixed. That could be achieved by adding some "if: false" conditions and commenting out and slight refactoring. Removing code is not a way to disable things temporarily.

"archlinux-latest",
"opensuse-15.5-gcc_11-python3.11",
"opensuse-tumbleweed-python3.10",
"opensuse-tumbleweed",
"conda-forge-python3.11",
"ubuntu-bionic-gcc_8-i386",
"debian-bullseye-i386",
]
# 'tox -e update_docker_platforms' updates above
tox_packages_factors:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the comment at L52. The main list is in tox.ini.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only changed the ci flows on purpose so that people trying to fix these other systems have an easy way to do this via tox/dev container. But I can completely remove them everywhere if you don't think that anyone will actively work on them.

@dimpase
Copy link
Member

dimpase commented Feb 8, 2025

Sorry. They are irrelevant questions. The point is that those docker images have been a part of our infrastructure. Removing the infrastructure should not be done without community-wide discussion.

No, the point it that there is no need to use these particular images. Standard images would do just as well.

Please don't tell me that I chose minimal images just out of my frivolity. For Binder, creating lightest image was my priority.

For the scope of this PR: I agree that portability-test jobs that fails on all platforms should stop wasting energy until fixed. That could be achieved by adding some "if: false" conditions and commenting out and slight refactoring. Removing code is not a way to disable things temporarily.

please don't try to justify the need for default (!) minimal configurations this way. You could have looked at a more lightweight distro such as Gentoo. Not screwing users and developers over with them has a clear priority over a slightly bigger binder image.

I am sick and tired of the constant pushback against bringing any normality to the project by removing this minimal nonsense.

@tobiasdiez
Copy link
Contributor Author

@kwankyu What exactly is broken by this PR? Please point it out. This PR is removing CI bits which have been broken for a long time. There is nothing wrong with this.

Did you see the section of the developer guide that I pointed out?

The removed chunks of ci-linux are responsible to push various docker images. One of the image, ubuntu-jammy-standard, is used in the CI check workflows for PRs. This PR will collapse the whole CI checks.

Thanks, I indeed forgot to change this. Fixed now.

Other images, ubuntu-jammy-minimal-with-targets and ubuntu-jammy-minimal-with-system-packages, are used in sage-binder-env. This PR will collapse it and the interactive sage doc that is using it.

The point of these docker images according to https://doc-release--sagemath.netlify.app/html/en/developer/portability_testing#using-our-pre-built-docker-images-published-on-ghcr-io is to "make[] it easy for developers to debug problems that showed up in the build logs for a given platform." You should use the official docker image and not these portability images for anything else. (Happy to review a PR that creates an official "minimal" docker images that is optimized for space; but note that "minimal" here refers to a completely different "minimal" than in the portability images. As Dima said, you probably want to choose a different base image than ubuntu in this case.) In either case, the old docker images are not deleted so everything should continue working; it's just that you don't get a new image.

Removing things in CI may break other things elsewhere. Especially the developer guide. In particular, https://doc-release--sagemath.netlify.app/html/en/developer/portability_testing. Did you check?

I don't think anything on there is broken on there. Could you please be more specific what you mean? I'm adding a sentence now that clarifies that some images are no longer pushed due to various errors.

@kwankyu
Copy link
Collaborator

kwankyu commented Feb 9, 2025

... You should use the official docker image and not these portability images for anything else.

Let's compare the ubuntu-jammy-minimal as an internal python variable prefixed with an underline. No one can tell me that I should not use the internal python variable if that fits my need. The internal python variable can be removed for their own need, and then, of course, I cannot complain.

Back to the present PR: I object to removing "-minimal" images. I didn't agree with Dima about his arguments for removing them. I think it is a sound idea of Matthias that sage is tested for "minimal", "standard", "maximal" environment. (For the audience, "minimal" means the minimal set of system packages installed before building sage.) They are "stress" tests to keep sage robust in diverse user platforms.

Moreover the "-minimal" images are useful for Binder.

Digression: The testing machine (consisting of ci-linux, ci-macos, tox, etc.) was maintained by Matthias. He is temporarily inactive. Apparently no one understand and can maintain the machine like him. So the machine is crumbling. Then what is a reasonable action in this situation? Fix the broken parts as far as we can. Turn off malfunctioning parts. But do not trash the broken parts.

Removing things in CI may break other things elsewhere. Especially the developer guide. In particular, https://doc-release--sagemath.netlify.app/html/en/developer/portability_testing. Did you check?

I don't think anything on there is broken on there. Could you please be more specific what you mean? I'm adding a sentence now that clarifies that some images are no longer pushed due to various errors.

For example, the section mentions "-minimal" images. If there are actually no "-minimal" images available, then the documentation is "broken".

@kwankyu
Copy link
Collaborator

kwankyu commented Feb 9, 2025

I am sick and tired of the constant pushback against bringing any normality to the project by removing this minimal nonsense.

Why don't you create a PR to remove the "minimal nonsense"? I would object, but if other developers agree, then you can achieve your welfare through normal procedure.

@dimpase
Copy link
Member

dimpase commented Feb 9, 2025

I am sick and tired of the constant pushback against bringing any normality to the project by removing this minimal nonsense.

Why don't you create a PR to remove the "minimal nonsense"? I would object, but if other developers agree, then you can achieve your welfare through normal procedure.

I don't want to do this precisely because I am sick and tired of this, not the least of your insistence on solving technical issues by vote. Because it's not normal procedure, and I think everyone who participated in the recent, unfinished, discussion on Sage project governance models agrees with this.

This thing trigger a PTSD for me.
We did try this "normal procedure", right, only to end up with a fork of the project! It does not work.

And these minimal configurations also don't work at all, or don't work well, it's very far from being a good idea--- they are mostly broken, leading to a broken CI.
Something this PR is trying to fix.

@dimpase
Copy link
Member

dimpase commented Feb 9, 2025

... You should use the official docker image and not these portability images for anything else.

Let's compare the ubuntu-jammy-minimal as an internal python variable prefixed with an underline. No one can tell me that I should not use the internal python variable if that fits my need. The internal python variable can be removed for their own need, and then, of course, I cannot complain.

Back to the present PR: I object to removing "-minimal" images. I didn't agree with Dima about his arguments for removing them. I think it is a sound idea of Matthias that sage is tested for "minimal", "standard", "maximal" environment. (For the audience, "minimal" means the minimal set of system packages installed before building sage.) They are "stress" tests to keep sage robust in diverse user platforms.

Moreover the "-minimal" images are useful for Binder.
How exactly are they useful? Making them 5% smaller?

Digression: The testing machine (consisting of ci-linux, ci-macos, tox, etc.) was maintained by Matthias. He is temporarily inactive. Apparently no one understand and can maintain the machine like him. So the machine is crumbling. Then what is a reasonable action in this situation? Fix the broken parts as far as we can. Turn off malfunctioning parts. But do not trash the broken parts.

I don't think he's coming back, and if he comes back and starts on his tricks again, I and I am sure many others will leave.
Anyhow, it is absolutely normal to remove broken code no-one understands.
Last but not the least, it's not "trashing", it can be restored and worked on if anyone cares.

Removing things in CI may break other things elsewhere. Especially the developer guide. In particular, https://doc-release--sagemath.netlify.app/html/en/developer/portability_testing. Did you check?

I don't think anything on there is broken on there. Could you please be more specific what you mean? I'm adding a sentence now that clarifies that some images are no longer pushed due to various errors.

For example, the section mentions "-minimal" images. If there are actually no "-minimal" images available, then the documentation is "broken".

Fine, this is not hard to fix, I am sure.

@user202729
Copy link
Contributor

it can be restored and worked on if anyone cares.

So @dimpase , hypothetically if someone comes forward and fix the thing to a working state, would you be okay with merging it?

(I believe yes, as it would be a reasonable answer in this case.)

@dimpase
Copy link
Member

dimpase commented Feb 9, 2025

it can be restored and worked on if anyone cares.

So @dimpase , hypothetically if someone comes forward and fix the thing to a working state, would you be okay with merging it?

(I believe yes, as it would be a reasonable answer in this case.)

There are profound reasons for it being very hard, if not just impossible.

  • E.g. building a relatively new gcc with have with an old gcc often doesn't work, you need these two versions to be close enough. As well, I would continue to object having gcc and g++ from the OS in a minimal configuration, but not having gfortran, because it makes no sense.
  • E.g. building a working Python in some configurations is quite tricky/impossible, due to cryptography libraries not being easily available.

But the main point is that maintaining something that is only ever used in a few rather esoteric cases, if at all, is not worth it, and we don't have resources to do it properly (as we see...). These minimal configs are basically leftovers from a mostly unsuccessful software engineering experiment, why do we need to be saddled with them ?

@user202729
Copy link
Contributor

(as we see...)

indeed… as much as I want to avoid heated discussions, it would be too much unpaid work if I were to try to fix the thing to defuse the issues here.

These minimal configs are basically leftovers from a mostly unsuccessful software engineering experiment, why do we need to be saddled with them ?

Evidently @kwankyu disagree that minimal test are useless, and this disagreement seems unsolvable, I think it would be more helpful for your case to show none of the working runs are removed instead.

Last but not the least, it's not "trashing", it can be restored and worked on if anyone cares.

I agree, but commenting out achieves the same effect; and also leaves indication for someone reading the code that the feature was there. I think this can be compared to marking a test case as # known bug (or flaky) instead of delete it entirely.


side note, why is test-new failing? (is it just a temporary failure because of network condition or is it caused by this pull request?)

@dimpase
Copy link
Member

dimpase commented Feb 9, 2025

it would be too much unpaid work

Indeed, and as an active developer with by far the most current package updates on my name, I don't like people who do much less of this kind of work -- directly related to the unhealthy number of packages we vendor -- to tell me to keep up with it.

My current job title is "research professor", in a CS department. I don't see how this sort of work is any kind of research, at a professorial level. Wearing my CS prof hood, I am saying that it's wrong from the CS/IT point of view to keep doing this kind of minimal configurations.

@tobiasdiez
Copy link
Contributor Author

I am very sorry that this PR triggered such an emotionally involved discussion. Perhaps it's best if everyone involved goes out for a walk, has a relaxed dinner with their partner or friends, or does something else to reset their mind. Then we can continue our discussion in a few days, hopefully with fresh eyes and more understanding for each other. See you then!

@dimpase
Copy link
Member

dimpase commented Feb 9, 2025

@tobiasdiez - before I forgot, please remove "fedora-30" from the list of platforms to test on. Fedora 30 reached EOL 5 years ago, it's pretty much impossible to get new Sage to build on it, and there is no point in this test anyway

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants