Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Megamerge #4464

Merged
merged 37 commits into from
May 5, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
8155e0a
Update version_template.py
DailyDreaming Feb 3, 2023
bbb234c
Update version_template.py
DailyDreaming Feb 4, 2023
54bfe0b
Update with fix.
DailyDreaming Feb 4, 2023
1638a93
Catch 'Device or resource busy'
jfennick Apr 18, 2023
3cfa675
Catch FileNotFoundError
jfennick Apr 18, 2023
77e1c9a
added @retry to _readJobState
jfennick Apr 19, 2023
68d20a9
Catch FileNotFoundError
jfennick Apr 19, 2023
a8a45ec
Allow ue of mesos roles (resolves #4455)
stephanaime Apr 26, 2023
8dd4099
Dont let --user-space-docker-cmd and --custom-net be mutually exclusi…
misterbrandonwalker Apr 28, 2023
c4d84a0
Update src/toil/cwl/cwltoil.py
misterbrandonwalker Apr 28, 2023
0646950
Update src/toil/cwl/cwltoil.py
misterbrandonwalker Apr 28, 2023
beaedd1
Change CWL internal job system to generic local job system
adamnovak Mar 20, 2023
8c20455
Make all WDL jobs except running or attempting to run a task local
adamnovak Mar 20, 2023
4df797e
Add a non-downloading file size implementation
adamnovak Mar 20, 2023
c7ff08c
Satisfy MyPy
adamnovak Mar 20, 2023
6270586
Eliminate the old stack and adopt a slightly more controlled successo…
adamnovak Mar 20, 2023
a50e3ef
Appease MyPy
adamnovak Mar 20, 2023
1b34313
Drop a lingering use of the old stack
adamnovak Mar 20, 2023
4915eab
Get mini tests to pass
adamnovak Mar 20, 2023
f8a5e7c
Get WDL workflow to start
adamnovak Mar 20, 2023
5fa9e4e
Add debugging to show that jobs are missing from the registry
adamnovak Mar 21, 2023
76c92ef
Fix ordering to actually look at the right job
adamnovak Mar 21, 2023
88e2e6e
Quiet debugging
adamnovak Mar 21, 2023
b7b4f68
Change log message
adamnovak Mar 21, 2023
d5a2211
Stop making unwanted changes
adamnovak May 2, 2023
f048e59
Remove redundant and wrongly-typed check
adamnovak May 2, 2023
5cd7702
Get docker-compose from apt instead of pip where it fights Toil deps
adamnovak May 4, 2023
f956b44
Ban urllib3 2.0+ until Docker module supports it
adamnovak May 4, 2023
894d0f6
Use docker-compose in the prebake
adamnovak May 4, 2023
4b824bb
Move Slurm tests to earlier stage
adamnovak May 4, 2023
7fc9b7f
Account for how docker compose plugin names containers
adamnovak May 4, 2023
83db89c
Revert "Move Slurm tests to earlier stage"
adamnovak May 4, 2023
1cf6cdc
Merge branch 'fix_docker_network' of https://github.com/misterbrandon…
adamnovak May 4, 2023
ca784fc
Merge branch 'robust_rmtree' of https://github.com/jfennick/toil into…
adamnovak May 4, 2023
8480abb
Merge branch 'issues/4455-support-mesos-roles' of https://github.com/…
adamnovak May 4, 2023
13f17d4
Undo unwanted version change
adamnovak May 4, 2023
781c361
Document new Mesos options and don't apply sort and high always
adamnovak May 4, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -164,7 +164,7 @@ slurm_test:
script:
- pwd
- cd contrib/slurm-test/
- pip install docker-compose
- docker compose version
- ./slurm_test.sh

cwl_v1.0:
Expand Down
1 change: 1 addition & 0 deletions contrib/slurm-test/docker-compose.yml
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
# This is a v3 compose file
services:
slurmmaster:
image: rancavil/slurm-master:19.05.5-1
Expand Down
31 changes: 18 additions & 13 deletions contrib/slurm-test/slurm_test.sh
Original file line number Diff line number Diff line change
@@ -1,22 +1,27 @@
#!/bin/bash
set -e
docker-compose up -d
# With the docker compose plugin, containers are named like slurm-test-slurmmaster-1
# If your containers are named like ${LEADER} you have the old docker-compose Python version instead.
# Try running with NAME_SEP=_
NAME_SEP=${CONTAINER_NAME_SEP:--}
LEADER="slurm-test${NAME_SEP}slurmmaster${NAME_SEP}1"
docker compose up -d
docker ps
docker cp toil_workflow.py slurm-test_slurmmaster_1:/home/admin
docker cp -L sort.py slurm-test_slurmmaster_1:/home/admin
docker cp fileToSort.txt slurm-test_slurmmaster_1:/home/admin
docker cp toil_workflow.py slurm-test_slurmmaster_1:/home/admin
docker cp toil_workflow.py ${LEADER}:/home/admin
docker cp -L sort.py ${LEADER}:/home/admin
docker cp fileToSort.txt ${LEADER}:/home/admin
docker cp toil_workflow.py ${LEADER}:/home/admin
GIT_COMMIT=$(git rev-parse HEAD)
docker exec slurm-test_slurmmaster_1 sudo apt install python3-pip -y
docker exec slurm-test_slurmmaster_1 pip3 install "git+https://github.com/DataBiosphere/toil.git@${GIT_COMMIT}"
docker exec slurm-test_slurmmaster_1 sinfo -N -l
docker exec ${LEADER} sudo apt install python3-pip -y
docker exec ${LEADER} pip3 install "git+https://github.com/DataBiosphere/toil.git@${GIT_COMMIT}"
docker exec ${LEADER} sinfo -N -l
# Test 1: A really basic workflow to check Slurm is working correctly
docker exec slurm-test_slurmmaster_1 python3 /home/admin/toil_workflow.py file:my-job-store --batchSystem slurm --disableCaching --retryCount 0
docker cp slurm-test_slurmmaster_1:/home/admin/output.txt output_Docker.txt
docker exec ${LEADER} python3 /home/admin/toil_workflow.py file:my-job-store --batchSystem slurm --disableCaching --retryCount 0
docker cp ${LEADER}:/home/admin/output.txt output_Docker.txt
# Test 2: Make sure that "sort" workflow runs under slurm
docker exec slurm-test_slurmmaster_1 python3 /home/admin/sort.py file:my-job-store --batchSystem slurm --disableCaching --retryCount 0
docker cp slurm-test_slurmmaster_1:/home/admin/sortedFile.txt sortedFile.txt
docker-compose stop
docker exec ${LEADER} python3 /home/admin/sort.py file:my-job-store --batchSystem slurm --disableCaching --retryCount 0
docker cp ${LEADER}:/home/admin/sortedFile.txt sortedFile.txt
docker compose stop
./check_out.sh
rm sort.py
echo "Sucessfully ran workflow on slurm cluster"
3 changes: 2 additions & 1 deletion contrib/toil-ci-prebake/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,8 @@ ENV DEBIAN_FRONTEND=noninteractive

RUN mkdir -p ~/.docker/cli-plugins/
RUN curl -L https://github.com/docker/buildx/releases/download/v0.6.3/buildx-v0.6.3.linux-amd64 > ~/.docker/cli-plugins/docker-buildx
RUN chmod u+x ~/.docker/cli-plugins/docker-buildx
RUN curl -L https://github.com/docker/compose/releases/download/v2.17.2/docker-compose-linux-x86_64 > ~/.docker/cli-plugins/docker-compose
RUN chmod u+x ~/.docker/cli-plugins/*

RUN apt-get -q -y update && \
apt-get -q -y upgrade && \
Expand Down
6 changes: 6 additions & 0 deletions docs/running/cliOptions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -186,6 +186,12 @@ levels in toil are based on priority from the logging module:
--mesosEndpoint MESOSENDPOINT
The host and port of the Mesos server separated by a
colon. (default: <leader IP>:5050)
--mesosFrameworkId MESOSFRAMEWORKID
Use a specific Mesos framework ID.
--mesosRole MESOSROLE
Use a Mesos role.
--mesosName MESOSNAME
The Mesos name to use. (default: toil)
--kubernetesHostPath KUBERNETES_HOST_PATH
Path on Kubernetes hosts to use as shared inter-pod temp
directory.
Expand Down
2 changes: 2 additions & 0 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
dill>=0.3.2, <0.4
requests>=2, <3
docker>=3.7.2, <6
# Work around https://github.com/docker/docker-py/issues/3113
urllib3>=1.26.0, <2.0.0
python-dateutil
psutil >= 3.0.1, <6
py-tes>=0.4.2,<1
Expand Down
7 changes: 3 additions & 4 deletions src/toil/batchSystems/local_support.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,7 +18,6 @@
UpdatedBatchJobInfo)
from toil.batchSystems.singleMachine import SingleMachineBatchSystem
from toil.common import Config
from toil.cwl.utils import CWL_INTERNAL_JOBS
from toil.job import JobDescription

logger = logging.getLogger(__name__)
Expand All @@ -40,11 +39,11 @@ def handleLocalJob(self, jobDesc: JobDescription) -> Optional[int]:
Returns the jobID if the jobDesc has been submitted to the local queue,
otherwise returns None
"""
if (not self.config.runCwlInternalJobsOnWorkers
and jobDesc.jobName.startswith(CWL_INTERNAL_JOBS)):
if (not self.config.run_local_jobs_on_workers
and jobDesc.local):
# Since singleMachine.py doesn't typecheck yet and MyPy is ignoring
# it, it will raise errors here unless we add type annotations to
# everything we get back from it. THe easiest way to do that seems
# everything we get back from it. The easiest way to do that seems
# to be to put it in a variable with a type annotation on it. That
# somehow doesn't error whereas just returning the value complains
# we're returning an Any. TODO: When singleMachine.py typechecks,
Expand Down
24 changes: 21 additions & 3 deletions src/toil/batchSystems/mesos/batchSystem.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,11 @@ def __init__(self, config, maxCores, maxMemory, maxDisk):

# Address of the Mesos master in the form host:port where host can be an IP or a hostname
self.mesos_endpoint = config.mesos_endpoint
if config.mesos_role is not None:
self.mesos_role = config.mesos_role
self.mesos_name = config.mesos_name
if config.mesos_framework_id is not None:
self.mesos_framework_id = config.mesos_framework_id

# Written to when Mesos kills tasks, as directed by Toil.
# Jobs must not enter this set until they are removed from runningJobMap.
Expand Down Expand Up @@ -160,7 +165,7 @@ def __init__(self, config, maxCores, maxMemory, maxDisk):

self.ignoredNodes = set()

self._startDriver()
self._startDriver(config)

def setUserScript(self, userScript):
self.userScript = userScript
Expand Down Expand Up @@ -310,14 +315,18 @@ def _buildExecutor(self):
info.source = pwd.getpwuid(os.getuid()).pw_name
return info

def _startDriver(self):
def _startDriver(self, config):
"""
The Mesos driver thread which handles the scheduler's communication with the Mesos master
"""
framework = addict.Dict()
framework.user = get_user_name() # We must determine the user name ourselves with pymesos
framework.name = "toil"
framework.name = config.mesos_name
framework.principal = framework.name
if config.mesos_role is not None:
framework.roles = config.mesos_role
framework.capabilities = [dict(type='MULTI_ROLE')]

# Make the driver which implements most of the scheduler logic and calls back to us for the user-defined parts.
# Make sure it will call us with nice namespace-y addicts
self.driver = MesosSchedulerDriver(self, framework,
Expand Down Expand Up @@ -839,8 +848,17 @@ def get_default_mesos_endpoint(cls) -> str:
def add_options(cls, parser: Union[ArgumentParser, _ArgumentGroup]) -> None:
parser.add_argument("--mesosEndpoint", "--mesosMaster", dest="mesos_endpoint", default=cls.get_default_mesos_endpoint(),
help="The host and port of the Mesos master separated by colon. (default: %(default)s)")
parser.add_argument("--mesosFrameworkId", dest="mesos_framework_id",
help="Use a specific Mesos framework ID.")
parser.add_argument("--mesosRole", dest="mesos_role",
help="Use a Mesos role.")
parser.add_argument("--mesosName", dest="mesos_name", default="toil",
help="The Mesos name to use. (default: %(default)s)")

@classmethod
def setOptions(cls, setOption: OptionSetter):
setOption("mesos_endpoint", None, None, cls.get_default_mesos_endpoint(), old_names=["mesosMasterAddress"])
setOption("mesos_name", None, None, "toil")
setOption("mesos_role")
setOption("mesos_framework_id")

11 changes: 6 additions & 5 deletions src/toil/batchSystems/options.py
Original file line number Diff line number Diff line change
Expand Up @@ -74,7 +74,7 @@ def set_batchsystem_options(batch_system: Optional[str], set_option: OptionSette
set_option("coalesceStatusCalls")
set_option("maxLocalJobs", int)
set_option("manualMemArgs")
set_option("runCwlInternalJobsOnWorkers", bool, default=False)
set_option("run_local_jobs_on_workers", bool, default=False)
set_option("statePollingWait")
set_option("batch_logs_dir", env=["TOIL_BATCH_LOGS_DIR"])

Expand Down Expand Up @@ -124,13 +124,14 @@ def add_all_batchsystem_options(parser: Union[ArgumentParser, _ArgumentGroup]) -
"Requires that TOIL_GRIDGENGINE_ARGS be set.",
)
parser.add_argument(
"--runLocalJobsOnWorkers"
"--runCwlInternalJobsOnWorkers",
dest="runCwlInternalJobsOnWorkers",
dest="run_local_jobs_on_workers",
action="store_true",
default=None,
help="Whether to run CWL internal jobs (e.g. CWLScatter) on the worker nodes "
"instead of the primary node. If false (default), then all such jobs are run on "
"the primary node. Setting this to true can speed up the pipeline for very large "
help="Whether to run jobs marked as local (e.g. CWLScatter) on the worker nodes "
"instead of the leader node. If false (default), then all such jobs are run on "
"the leader node. Setting this to true can speed up CWL pipelines for very large "
"workflows with many sub-workflows and/or scatters, provided that the worker "
"pool is large enough.",
)
Expand Down
3 changes: 1 addition & 2 deletions src/toil/common.py
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ class Config:
logRotating: bool
cleanWorkDir: str
maxLocalJobs: int
runCwlInternalJobsOnWorkers: bool
run_local_jobs_on_workers: bool
tes_endpoint: str
tes_user: str
tes_password: str
Expand Down Expand Up @@ -438,7 +438,6 @@ def check_nodestoreage_overrides(overrides: List[str]) -> bool:
set_option("environment", parseSetEnv)
set_option("disableChaining")
set_option("disableJobStoreChecksumVerification")
set_option("runCwlInternalJobsOnWorkers")
set_option("statusWait", int)
set_option("disableProgress")

Expand Down
19 changes: 13 additions & 6 deletions src/toil/cwl/cwltoil.py
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@
from toil.version import baseVersion
from toil.exceptions import FailedJobsException


logger = logging.getLogger(__name__)

# Find the default temporary directory
Expand Down Expand Up @@ -1863,6 +1864,7 @@ def __init__(
tool_id: Optional[str] = None,
parent_name: Optional[str] = None,
subjob_name: Optional[str] = None,
local: Optional[bool] = None,
) -> None:
"""
Make a new job and set up its requirements and naming.
Expand Down Expand Up @@ -1905,6 +1907,7 @@ def __init__(
accelerators=accelerators,
unitName=unit_name,
displayName=display_name,
local=local,
)


Expand All @@ -1918,7 +1921,7 @@ class ResolveIndirect(CWLNamedJob):

def __init__(self, cwljob: Promised[CWLObjectType], parent_name: Optional[str] = None):
"""Store the dictionary of promises for later resolution."""
super().__init__(parent_name=parent_name, subjob_name="_resolve")
super().__init__(parent_name=parent_name, subjob_name="_resolve", local=True)
self.cwljob = cwljob

def run(self, file_store: AbstractFileStore) -> CWLObjectType:
Expand Down Expand Up @@ -2094,7 +2097,10 @@ def __init__(
):
"""Store our context for later evaluation."""
super().__init__(
tool_id=tool.tool.get("id"), parent_name=parent_name, subjob_name="_wrapper"
tool_id=tool.tool.get("id"),
parent_name=parent_name,
subjob_name="_wrapper",
local=True,
)
self.cwltool = remove_pickle_problems(tool)
self.cwljob = cwljob
Expand Down Expand Up @@ -2482,7 +2488,7 @@ def __init__(
conditional: Union[Conditional, None],
):
"""Store our context for later execution."""
super().__init__(cores=1, memory="1GiB", disk="1MiB")
super().__init__(cores=1, memory="1GiB", disk="1MiB", local=True)
self.step = step
self.cwljob = cwljob
self.runtime_context = runtime_context
Expand Down Expand Up @@ -2642,7 +2648,7 @@ def __init__(
outputs: Promised[Union[CWLObjectType, List[CWLObjectType]]],
):
"""Collect our context for later gathering."""
super().__init__(cores=1, memory="1GiB", disk="1MiB")
super().__init__(cores=1, memory="1GiB", disk="1MiB", local=True)
self.step = step
self.outputs = outputs

Expand Down Expand Up @@ -2743,7 +2749,7 @@ def __init__(
conditional: Union[Conditional, None] = None,
):
"""Gather our context for later execution."""
super().__init__(tool_id=cwlwf.tool.get("id"), parent_name=parent_name)
super().__init__(tool_id=cwlwf.tool.get("id"), parent_name=parent_name, local=True)
self.cwlwf = cwlwf
self.cwljob = cwljob
self.runtime_context = runtime_context
Expand Down Expand Up @@ -3264,7 +3270,8 @@ def main(args: Optional[List[str]] = None, stdout: TextIO = sys.stdout) -> int:
help="Do not delete Docker container used by jobs after they exit",
dest="rm_container",
)
dockergroup.add_argument(
extra_dockergroup = parser.add_argument_group()
extra_dockergroup.add_argument(
"--custom-net",
help="Specify docker network name to pass to docker run command",
)
Expand Down
10 changes: 0 additions & 10 deletions src/toil/cwl/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -37,16 +37,6 @@

# Customized CWL utilities

# Define internal jobs we should avoid submitting to batch systems and logging
CWL_INTERNAL_JOBS: Tuple[str, ...] = (
"CWLJobWrapper",
"CWLWorkflow",
"CWLScatter",
"CWLGather",
"ResolveIndirect",
)


# What exit code do we need to bail with if we or any of the local jobs that
# parse workflow files see an unsupported feature?
CWL_UNSUPPORTED_REQUIREMENT_EXIT_CODE = 33
Expand Down
2 changes: 2 additions & 0 deletions src/toil/fileStores/nonCachingFileStore.py
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,7 @@
from toil.lib.compatibility import deprecated
from toil.lib.conversions import bytes2human
from toil.lib.io import make_public_dir, robust_rmtree
from toil.lib.retry import retry
from toil.lib.threading import get_process_name, process_name_exists

logger: logging.Logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -301,6 +302,7 @@ def _getAllJobStates(cls, coordination_dir: str) -> Iterator[Dict[str, str]]:
raise

@staticmethod
@retry(errors=[OSError])
def _readJobState(jobStateFileName: str) -> Dict[str, str]:
with open(jobStateFileName, 'rb') as fH:
state = dill.load(fH)
Expand Down
Loading