GitHub - lucasjliu/task-scheduler: A fault tolerant distributed task scheduler simulation

This is a fault tolerant distributed task scheduler simulation.

Design

Problem Statement

Tasks are created and stored in a persistent task pool. A master gets tasks from it and assign to workers. Each time a task runs, the worker will just sleep for the given sleeptime seconds, after than finish. Each worker can work on one task at a time, and has no access to the task pool. For fault tolerance, a simulation result is assumed to be correct only if:

All tasks eventually finish and are logged
Killing the master does not affect running tasks
Killing a worker will kill tasks running on it
Killed tasks will be restarted from scratch
Master is able to find out worker failures

Scheduling

For each task, keep waiting until one of the available workers is picked randomly (simple load balance) to serve it.

Scalability

There are two optional types of worker: short-lived and persistent. A short-lived worker has no persistent storage so will be considered a new one upon each restart; a persistent worker stores running status and recovers from last failure. When a worker is started, it first try to recover (i.e. re-sending pending messages), and then register a unique identifier (UID) from target masters which it will receive tasks from. So a worker can serve multiple masters, and can be registered by simply sending a request to the masters.

Fault Tolenrance

There are two possible ways to handle worker failure: master send heartbeat message to workers regularly so is able to know the failure; otherwise if the worker is up before next heartbeat, it will notify master itself. Worker can know previous unfinished tasks by always keeping on-going tasks in persistence, if persistence storage is available. For either handling the failed tasks are put back to the task pool.

For master failure, when master is restarted previous tasks will be either reported by the worker or considered failed if the worker is also down. Here a complicated situation may happen: master is killed, task-0 is done, worker A fails to notify master, and later A is also killed, then master is up and restarts task-0 (master does not wait for A because A could be down forever). After this when A is up, if A is short-lived the task will be just restarted; otherwise A should send the task-0 success message to master. To distinguish these two task-0, current_tasks is maintained in persistence by master in order to accept the on-going task-0 only.

Architecture

Data persistence is handled by MongoDB. HTTP is chosen to be the protocol for the communication among nodes, and the interfaces are design in a RESTful manner. A Flask server is set up within each module to serve the HTTP requests. Following is the modules and their interfaces design.

TaskPool: keeps tasks in MongoDB, accessed by master.

/get: to get a task from the list

/put: to put a task

/update: to update status of a task: running, failure, or sucess

Master: for monitoring and tasks assigning.

/notify: for worker to notify master a success/failed task

/register: get a new worker and assign a UID for it

Worker: nodes that do the tasks.

/ping: to determine if the node is down, and let the node notify master any previous task status. Note that this interface does not return the status to avoid race condition with \notify.

/doTask: assign a task by master

Documentation

./docker
    dockerfile for the image
./logs
    master: log file of master.py
./scheduler
    common.py: common utilities
    master.py: entry point for master node
    taskpool.py: entry point for taskpool node
    worker.py: entry point for worker node
./tests
    echo.py: a simple http application to verify essential libraries are installed

Usage Instructions

Nodes are run in docker containers. All following command are peformed in project directory.

Use setup_nodes.sh to build the docker image and start containers for: a master, a taskpool, three workers
Run mongodb service, set /etc/hosts according to the comment in setup_nodes.sh
(Optionl) Try ./tests/echo.py, no error message should appear
Start the taskpool: ./scheduler/taskpool.py
Start the master: ./scheduler/master.py
Start the workers with hostname and port for falsk serving, and a list of master hosts as the parameters:

./scheduler/worker.py worker1 5000 master:5000
./scheduler/worker.py worker2 5000 master:5000
./scheduler/worker.py worker3 5000 master:5000

Kill master or workers with Ctrl-C and restart it. The failed tasks should appear in the bottom of the log file.

Sample Result

log_simple_10_tasks is a short piece of example running 10 tasks. Two parallel workers were scheduled correctly. And the failed task-5 due to worker1 was handled.

log_3_workers_100_tasks is an example of 100 tasks with 3 workers. Here is what happened at running:

All workers were killed around line 15, and restarted one by one. Failed tasks 9, 11, 12 were handled (and appeared at the end of log file).
At line 40 master was killed and restarted and no error happened.
Master was killed again around line 55 with all workers also killed afterwards. However, worker1 was restarted before master is up, worker2 after master is up, and worker3 was never restarted. Worker1 did not issue a failure since it recovered by itself. Either failures issued by worker2 and 3 is properly handled.
At line 88 worker2 was killed before master was killed, the failure captured.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Design

Problem Statement

Scheduling

Scalability

Fault Tolenrance

Architecture

Documentation

Usage Instructions

Sample Result

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
docker		docker
logs		logs
scheduler		scheduler
tests		tests
LICENSE		LICENSE
README.md		README.md
setup_nodes.sh		setup_nodes.sh

License

lucasjliu/task-scheduler

Folders and files

Latest commit

History

Repository files navigation

Design

Problem Statement

Scheduling

Scalability

Fault Tolenrance

Architecture

Documentation

Usage Instructions

Sample Result

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages