Skip to content

Conversation

tollsimy
Copy link

@tollsimy tollsimy commented Apr 3, 2025

This PR aims to fix and improve containers for local deployments by:

  • remove obsolete Dockerfiles (move to ./kci docker)
  • merging all compose files in one (pipeline + lava-callback + pipeline-kcidb)
  • refactoring compose files and removing obsolete entries
  • add missing config path argument to services
  • updating relative docs

This PR is tightly related to recent fixes and improvements I am proposing in kernelci/kernelci-deploy.

@tollsimy tollsimy force-pushed the bugfix/containers branch 2 times, most recently from 06e042e to c9fc75f Compare April 3, 2025 09:03
Copy link
Member

@nuclearcat nuclearcat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to reproduce bug this patch (rename timeout) is fixing?

@tollsimy
Copy link
Author

tollsimy commented Apr 3, 2025

Rename timeout patch fixes:

cycle detected at path: services.timeout
cycle detected at path: services.timeout

Environment:

docker --version
Docker version 28.0.2, build 0442a73
docker-compose --version
Docker Compose version v2.30.2

@tollsimy tollsimy force-pushed the bugfix/containers branch from c9fc75f to 5030564 Compare April 4, 2025 09:55
@tollsimy
Copy link
Author

tollsimy commented Apr 4, 2025

What about something like this? I removed obsolete Dockerfiles. I will open another PR in deployment to add missing container builds.

@nuclearcat
Copy link
Member

I will need to check if renaming will cause any issues with existing deployment scripts, it might take a bit of time, but likely will do this week.

@tollsimy tollsimy marked this pull request as draft April 10, 2025 11:37
@tollsimy
Copy link
Author

tollsimy commented Apr 10, 2025

Edit: apparently the cycle detected at path is a docker-compose bug related to yaml anchors. Does not happen if using the new docker compose command. Time to remove depreated docker-compose from local deployment scripts.
I will change and push new changes.

@tollsimy tollsimy changed the title Fix containers for local deployments Fix and improve local deployment Apr 10, 2025
@tollsimy tollsimy changed the title Fix and improve local deployment Fix and improve for local deployment Apr 10, 2025
@tollsimy tollsimy marked this pull request as ready for review April 10, 2025 16:51
@nuclearcat nuclearcat added the staging-skip Don't test automatically on staging.kernelci.org label Apr 14, 2025
@nuclearcat
Copy link
Member

Probably this PR causes breakage at staging script

ERROR: Version mismatch: file ./docker-compose.yaml specifies version 1 but extension file ./docker-compose-kcidb.yaml uses version 3.0

@tollsimy
Copy link
Author

tollsimy commented Apr 14, 2025

You're right @nuclearcat , since this PR merges pipeline-lava-callback, pipeline-kcidb and all other pipeline services in one compose file in 03042de, staging.kernelci.org in kernelci-deploy should be modified to execute only one compose.
Moreover, the version mismatch is due to the fact the the old (and deprecated) docker-compose is used.
Upgrading to the new docker compose plugin should remove that error (since "version" property is also deprecated).

Not sure but maybe upgrading to the new docker compose could also fix issue reported kernelci/kernelci-api#598 and #1045.

@tollsimy tollsimy force-pushed the bugfix/containers branch 3 times, most recently from d9882c7 to 24ea02e Compare April 15, 2025 10:40
@nuclearcat nuclearcat removed the staging-skip Don't test automatically on staging.kernelci.org label Apr 22, 2025
@nuclearcat
Copy link
Member

nuclearcat commented Apr 22, 2025

https://github.com/kernelci/kernelci-core/actions/runs/14589446598/job/40921114087


2025-04-22 07:43:10,999 - INFO - Checking PR 1102: `Fix and improve for local deployment` by tollsimy
2025-04-22 07:43:10,999 - INFO - Processing PR 1102
error: callback-requirements.txt: No such file or directory
error: patch failed: docker-compose.yaml:106
error: docker-compose.yaml: patch does not apply
error: patch failed: docker-compose.yaml:155
error: docker-compose.yaml: patch does not apply
2025-04-22 07:43:11,294 - ERROR - Failed to apply patch for PR 1102

@nuclearcat
Copy link
Member

I think you need to rebase your patch

@nuclearcat nuclearcat added the staging-skip Don't test automatically on staging.kernelci.org label Apr 22, 2025
@tollsimy tollsimy force-pushed the bugfix/containers branch from ce5cbf1 to 78a84a1 Compare April 23, 2025 07:28
@tollsimy
Copy link
Author

tollsimy commented Apr 23, 2025

Done, this should work, thanks @nuclearcat !

@tollsimy tollsimy force-pushed the bugfix/containers branch from 78a84a1 to 6253e95 Compare April 29, 2025 13:01
@tollsimy tollsimy force-pushed the bugfix/containers branch from 6253e95 to 5c449a1 Compare May 14, 2025 16:48
@tollsimy tollsimy force-pushed the bugfix/containers branch from 5c449a1 to fc674bb Compare May 14, 2025 16:59
@tollsimy
Copy link
Author

In order not to break staging this would require to change staging.kernelci.org in kernelci-deploy from:

cmd_api_pipeline() {
    compose_files="\
-f docker-compose.yaml \
-f docker-compose-kcidb.yaml \
-f docker-compose-lava.yaml \
-f docker-compose-production.yaml \
"

to

cmd_api_pipeline() {
    compose_files="\
-f docker-compose.yaml \
-f docker-compose-production.yaml \
"

@padovan
Copy link
Contributor

padovan commented Aug 14, 2025

Due to the changes required in the staging deployment, we are not able to afford merging these patches for the time being. So closing it.

@padovan padovan closed this Aug 14, 2025
@tollsimy
Copy link
Author

Thank you @padovan for your response.

I’d like to clarify that making this compatible with the current staging setup only requires negligible changes (in this case literally 2 LOC). During the work with RISC-V, we extensively tested this patch, as well as the other related but still unmerged changes, using our self-hosted deployment. The results were solid, and we shared them publicly as part of our efforts.

I completely understand the maintenance burden and am more than willing to help with any testing or fixes needed to move this forward. Please consider me available to assist with whatever is required.

As far as I know, KernelCI remains the only project that enables anyone in the community to run kernel testing pipelines without needing to set up and maintain custom internal infrastructure. These are significant technical and operational hurdles that discourage participation from many individual contributors and smaller organizations.

The changes we proposed are designed to make KernelCI more accessible and extend its utility to a wider audience, practically making it available for an AIO (but still non-production ready) deployment. It's disheartening to see hesitation in putting some effort into work that could significantly empower the community and lower the barrier to contributing high-quality kernel testing.

I truly hope there’s still an opportunity to revisit this.

Simone

@nuclearcat
Copy link
Member

Hi Simone,
Thanks for your detailed message and for your willingness to help - it's genuinely appreciated.
Let me explain why we're hesitant with this PR. The main issue isn't the 2 LOC you mentioned, but rather that the full changeset introduces multiple breaking changes that would require significant testing time we simply don't have right now.
We've learned the hard way that keeping changes small and incremental is crucial. When something breaks, we need to pinpoint the issue quickly. With bulk PRs, this becomes a nightmare - if the first commit breaks staging, I honestly just stop there rather than dig through everything.
Let me give you some concrete examples from your PR that illustrate the concerns:
Your commit 0e1455a says it removes obsolete Dockerfiles, but it actually changes the image registry from ghcr.io back to Docker Hub. We migrated to ghcr.io specifically because Docker Hub's pull limits were killing us during high-load testing. This single line would reintroduce that whole mess - when we hit those limits again, everything would grind to a halt.
The logger name changes in 136748f break our existing monitoring scripts. Sure, they might work fine in your setup, but we'd need to update and retest all our tooling.
The path changes from absolute to relative in several commits might seem harmless, but they introduce risks without clear benefits. For instance, not using shebangs is actually a deliberate choice - it's more reliable than hoping the executable bit is set correctly.
That volume mount you removed in beefb80? We use that for hot-patching staging without rebuilding Docker images. It's a debugging lifesaver for us.
That's not all, but what i was able to look quickly, as i said PR is a bit too big.

Here's what would really help:

Break this into individual PRs, one logical change per PR. Start with the most critical fixes.
Before making operational changes (like switching registries or removing mounts), let's discuss them. You might not be aware of all the operational context - Docker Hub limits, debugging workflows, etc.
Focus on changes that benefit the broader project. If something is specific to one deployment, it needs to be weighed against the testing burden on the team.

I get that KernelCI's accessibility is important - we want that too! But we need to balance that with keeping the existing infrastructure stable. Small, well-discussed changes are much more likely to land than large, surprising ones.
Would you be up for breaking this down and starting with the most critical pieces? Happy to discuss which changes would be most valuable to prioritize.

@tollsimy
Copy link
Author

Thanks @nuclearcat for the response! I really appreciate the feedback!
I'll try to do my best and get back to you as soon as I have something ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
staging-skip Don't test automatically on staging.kernelci.org
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants