Skip to content

Handle multiple set of job logs for restarted jobs #552

@vitodb

Description

@vitodb
Contributor

There are cases of jobs configured with AutoRelease feature that are trying to copy back logs both times they run, but the second time the log copy fails because ifdh cannot override existing file.

An example job is 67623825.0@jobsub01.fnal.gov
The job was part of POMS4_SUBMISSION_ID:1712364.
Fifebatch Events details show the job got held and released.
IFDH logs for the job show the log copy back failed the second time:

gfal-copy error: 17 (File exists) - Destination https://[redacted]/fermigrid/jobsub/jobs/2024_03_12/6f20c05e-8023-4248-966f-0233d5a3c089/fife_wrap2024_03_12_1822276f20c05e-8023-4248-966f-0233d5a3c089cluster.67623825.0.err exists and overwrite is not set

The job in kibana is reported with Exit code 0, while checking the stdout log we have:

executable was killed: exiting 1
Wed Mar 13 07:20:24 UTC 2024 fife_wrap COMPLETED with exit status 1

which is confusing.
This is happening because the log is for the first time the job ran, while the job exit state kibana is possibly for te second time the job ran.

As discussed at the Jobsub weekly meeting, we could use the NumJobStarts classAd, or something similar, as suffix for the log filename to disentangle logs for each time the job is restarted and so be able to copy them back all, possibly making them available to users.

Activity

added this to the 1.13 milestone on Jul 23, 2025
assigned and unassigned on Jul 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Metadata

Metadata

Assignees

Labels

enhancementNew feature or request

Type

No type

Projects

No projects

Relationships

None yet

    Development

    No branches or pull requests

      Participants

      @shreyb@vitodb

      Issue actions

        Handle multiple set of job logs for restarted jobs · Issue #552 · fermitools/jobsub_lite