Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

V2 worker errors when clearing up data from a failled run #1080

Open
sambles opened this issue Jul 11, 2024 · 0 comments
Open

V2 worker errors when clearing up data from a failled run #1080

sambles opened this issue Jul 11, 2024 · 0 comments
Assignees
Labels

Comments

@sambles
Copy link
Contributor

sambles commented Jul 11, 2024

Issue Description

Analysis seems to fail from OOM, the following clean-up has some unusual errors which point to something not being handled correctly.

  1. botocore.exceptions.ClientError: An error occurred (MalformedXML) ~ when deleting tmp stored data?
  2. find: ‘output’: No such file or directory from the failed bash script

Version / Environment information

  • Worker 2.3.6

Example data / logs

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/worker/.local/lib/python3.10/site-packages/celery/utils/dispatch/signal.py", line 276, in send
    response = receiver(signal=self, sender=sender, **named)
  File "/home/worker/src/model_execution_worker/distributed_tasks.py", line 1033, in handle_task_failure
    filestore.delete_dir(dir_remote_data)
  File "/home/worker/src/model_execution_worker/backends/aws_storage.py", line 347, in delete_dir
    rsp = self.bucket.delete_objects(Delete=del_request)
  File "/home/worker/.local/lib/python3.10/site-packages/boto3/resources/factory.py", line 581, in do_action
    response = action(self, *args, **kwargs)
  File "/home/worker/.local/lib/python3.10/site-packages/boto3/resources/action.py", line 88, in __call__
    response = getattr(parent.meta.client, operation_name)(*args, **params)
  File "/home/worker/.local/lib/python3.10/site-packages/botocore/client.py", line 565, in _api_call
    return self._make_api_call(operation_name, kwargs)
  File "/home/worker/.local/lib/python3.10/site-packages/botocore/client.py", line 1021, in _make_api_call
    raise error_class(parsed_response, operation_name)
botocore.exceptions.ClientError: An error occurred (MalformedXML) when calling the DeleteObjects operation: The XML you provided was not well-formed or did not validate against our published schema
[2024-07-11 00:18:34,411: ERROR/ForkPoolWorker-13] Task generate_losses_chunk[6ec3cac5-2219-4707-890b-d9dee361582c] raised unexpected: OasisException('Ktools run Error: non-zero exit code or error/warning messages detected in STDERR output.\nKilling all processes. To disable this automated check run with `--ktools-disable-guard`.\nLogs stored in: /tmp/run/analysis-1782_losses-b7c033e1f07d4b6c9572691fabcddd83/run-data/log/46')
[2024-07-11 00:18:32,151: INFO/ForkPoolWorker-6] generate_losses_chunk[452036cf-c668-474a-8fa1-2a3b4e3d019f]: WARNING: task requeue detected - retry 2                                                                                                         
[2024-07-11 00:18:32,177: INFO/ForkPoolWorker-6] RUNNING: oasislmf.manager.interface
[2024-07-11 00:18:32,179: INFO/ForkPoolWorker-6] Generated loss Chunk 63 of 64 in, /tmp/run/analysis-1781_losses-bce6b4453c234bafa7d047b69613e25e/run-data
[2024-07-11 00:18:32,179: INFO/ForkPoolWorker-6] RUNNING: oasislmf.execution.runner.run_analysis
find: ‘output’: No such file or directory
[2024-07-11 00:18:32,241: INFO/ForkPoolWorker-6] 
KTOOLS_STDERR:

[2024-07-11 00:18:32,241: INFO/ForkPoolWorker-6] 
[2024-07-11 00:18:32,241: ERROR/ForkPoolWorker-6] generate_losses_chunk[452036cf-c668-474a-8fa1-2a3b4e3d019f]: Error occured in 'loss_generation_task':
Traceback (most recent call last):
  File "/home/worker/.local/lib/python3.10/site-packages/oasislmf/computation/generate/losses.py", line 520, in run 
    return model_runner_module.run_analysis(**bash_params)
  File "/home/worker/.local/lib/python3.10/site-packages/oasislmf/utils/log.py", line 123, in wrapper
    result = func(*args, **kwargs)
  File "/home/worker/.local/lib/python3.10/site-packages/oasislmf/execution/runner.py", line 119, in run_analysis
    bash_trace = subprocess.check_output(['bash', params['filename']]).decode('utf-8')
  File "/usr/lib/python3.10/subprocess.py", line 421, in check_output
    return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
  File "/usr/lib/python3.10/subprocess.py", line 526, in run 
    raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['bash', '/tmp/run/analysis-1781_losses-bce6b4453c234bafa7d047b69613e25e/run-data/63.run_analysis.sh']' died with <Signals.SIGKILL: 9>. 

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/worker/src/model_execution_worker/distributed_tasks.py", line 861, in run 
    return fn(self, params, *args, analysis_id=analysis_id, **kwargs)
  File "/home/worker/src/model_execution_worker/distributed_tasks.py", line 943, in generate_losses_chunk
    OasisManager().generate_losses_partial(**chunk_params)
  File "/home/worker/.local/lib/python3.10/site-packages/oasislmf/utils/log.py", line 123, in wrapper
    result = func(*args, **kwargs)
  File "/home/worker/.local/lib/python3.10/site-packages/oasislmf/manager.py", line 94, in interface
    return computation_cls(**kwargs).run()
  File "/home/worker/.local/lib/python3.10/site-packages/oasislmf/computation/generate/losses.py", line 523, in run 
    self._print_error_logs(log_fp, e)
  File "/home/worker/.local/lib/python3.10/site-packages/oasislmf/computation/generate/losses.py", line 158, in _print_error_logs
    raise OasisException(
oasis_data_manager.errors.OasisException: Ktools run Error: non-zero exit code or error/warning messages detected in STDERR output.
Killing all processes. To disable this automated check run with `--ktools-disable-guard`.
Logs stored in: /tmp/run/analysis-1781_losses-bce6b4453c234bafa7d047b69613e25e/run-data/log/63

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Todo
Development

No branches or pull requests

1 participant