Skip to content

Commit 273ef77

Browse files
committed
Added reproducibility instructions
1 parent 27a2388 commit 273ef77

File tree

3 files changed

+120
-5
lines changed

3 files changed

+120
-5
lines changed

EXP_CONFIGS.md

+73
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,73 @@
1+
# Experiment Configurations for ECRec
2+
3+
## Keys for Evaluation
4+
5+
Note: ECRM, an earlier name of our system, refers to ECRec in this document.
6+
7+
- Modes of running the DLRM training system. These are mostly controlled by which branch to run.
8+
9+
- XDL: original XDL system with no fault tolerance. This is the `master` branch.
10+
- ECRM-kX: ECRM with value of parameter k being X. This is the version of ECRM described above including 2PC, neural network replication, etc. This is the `ecrec` branch. The parameter `k` is controlled by `PARITY_K` in [`xdl/ps-plus/ps-plus/common/base_parity_utils.h`](xdl/ps-plus/ps-plus/common/base_parity_utils.h). You will need to change `PARITY_N` at the same time.
11+
- Ckpt-X: XDL with checkpointing, where checkpoints are written every X minutes. This is the `master` branch. The checkpoint frequency is specified in [`xdl/examples/criteo/criteo_training.py`](xdl/examples/criteo/criteo_training.py).
12+
- DLRMs on which to train. These are controlled by the variable settings in [`xdl/examples/criteo/criteo_training.py`](xdl/examples/criteo/criteo_training.py). The x and y in Criteo-xS-yD are controlled by the 4th and 5th argument to function `xdl.embedding` in these files. Note we have other scripts available in [`xdl/examples/criteo`](xdl/examples/criteo).
13+
- Criteo: original Criteo DLRM
14+
- Criteo-2S: as described in paper—2x the number of embedding table entries
15+
- Criteo-2S-2D: as described in paper—2x the number of embedding table entries, and each entry being twice as wide (e.g., 64 dense rather than 32)
16+
- Important note: see in paper that a difference instance type is used when running this setup
17+
- Batch size is 2048. This is set in [`xdl/examples/criteo/criteo_training.py`](xdl/examples/criteo/criteo_training.py).
18+
19+
## Evaluations
20+
21+
- Normal mode 1
22+
- Using the same evaluation setup described in the paper
23+
- For each of [Criteo, Criteo-2S, Criteo-2S-2D, Criteo-4S, Criteo-8S]
24+
- For each of [XDL, Ckpt-30, Ckpt-60, ECRM-k2, ECRM-k4]
25+
- Run the DLRM for 2 hours. Keep all of the logs
26+
- At the end of this we will be able to plot Figures 5, 6, 7, 8, and 9
27+
- Recovery mode 1
28+
- Using the same evaluation setup described in the paper
29+
- For each of [Criteo, Criteo-2S, Criteo-2S-2D, Criteo-4S, Criteo-8S]
30+
- For each of [Ckpt-30, Ckpt-60, ECRM-k2, ECRM-k4]
31+
- Run the DLRM normally for 15 minutes, trigger a failure, and let each mode run for 60 additional minutes
32+
- At the end of this we want to be able to plot Figures 10, 11, and 12
33+
- Normal mode 4: effect of NN replication, 2PC, etc.
34+
- The goal here is to determine how much slower things like NN replication and 2PC make ECRM
35+
- New key for this section:
36+
- k=4.
37+
- ECRM-kX: As described above
38+
- ECRM-kX-minusNNRep: same as ECRM-kX, but without replicating NN params
39+
- Another way of saying this is "without optimizer state" because then we can recover the "approximate" NN from one of the workers
40+
- [Maybe don't do this one] ECRM-kX-minus2PC: same as ECRM-kX, but without using 2PC
41+
- ECRM-kX-minusNNRep-minus2PC: same as ECRM-kX, but without replicating NN params and without using 2PC
42+
- I think this is the same as Kaige's original branch
43+
- For mode in [ECRM-k4, ECRM-k4-minusNNRep, ECRM-k4-minusNNRep-minus2PC (Kaige's branch)]
44+
- For DLRM in [Criteo-8S, Criteo-4S, Criteo-2S, Criteo, Criteo-2S-2D]
45+
- Run in normal mode. Save all the logs
46+
- Here, we will want to compare the average throughput across each mode
47+
- Recovery mode 2: effect of lock granularity
48+
- Using the same evaluation setup described in the paper
49+
- For each of [Criteo-8S, Criteo] (later do Criteo-4S, Criteo-2S, Criteo-2S2D)
50+
- For each mode in [ECRM-k4]
51+
- For num\_locks in [1, 10]
52+
- Trigger a failure, keep all the logs
53+
- Measure how long it took to fully recover
54+
- At the end of this we want to compare how long it takes to fully recover from a failure with 1 lock vs. with 10 locks
55+
- Normal mode 2: effect of limited resources
56+
- Which DLRM we choose to run here depends on the results we see from Normal Mode 1
57+
- For each mode in [ECRM-k4]
58+
- For each DLRM in [Criteo-8S, Criteo]
59+
- For each server instance type in [x1e.2xlarge, r5.8xlarge (note that this is different from r5n.8xlarge)]
60+
- Run DLRM. Keep all of the logs
61+
- At the end of this, we want to compare the average throughput to that when running on the original instances we considered
62+
- Normal mode 3: effect of increased number of workers
63+
- For num\_workers in [5, 10, 15, 20, 25]
64+
- For each DLRM in [Criteo-8S]
65+
- For each mode in [XDL, ECRM-k4]
66+
- Run the DLRM in normal mode. Save all the logs
67+
- Normal mode 3.1
68+
- For num\_workers in [5, 10, 15, 20, 25]
69+
- For each DLRM in [Criteo-4S, Criteo-2S, Criteo, Criteo-2S2D]
70+
- For each mode in [XDL, ECRM-k4]
71+
- Run the DLRM in normal mode. Save all the logs
72+
- We want to plot something like Figure 9 from this
73+
- Note that we don't need to rerun checkpointing for this, as the "steady-state" overhead from checkpointing can be computed based on the time it takes to write a checkpoint, the throughput of normal XDL, and the checkpointing frequency.

README.md

+18-5
Original file line numberDiff line numberDiff line change
@@ -69,16 +69,29 @@ Note that you may need to change/tune parameters in the above commands to obtain
6969

7070
## Bulk Run (Experiments)
7171

72-
XDL/ECRec is designed to run distributedly on a set of hosts over a network. To enable repeatable and reproducible experiments, we provide a reference experiment launching program that allows spawning ECRec clusters on AWS and training on them with simple commands. The program can be found in this repo at [`launch_exp.py`](launch_exp.py). You will need to fill in AWS EC2 keypair and GitHub credentials information in the program script.
72+
XDL/ECRec is designed to run distributedly on a set of hosts over a network. To enable repeatable and reproducible experiments, we provide a reference experiment launching program that allows spawning ECRec clusters on AWS and training on them with simple commands. The program can be found in this repo at [`launch_exp.py`](launch_exp.py). You will need to fill in AWS EC2 keypair and GitHub credentials information in the program script. Additional useful variables to modify are described below.
7373

74-
You can configure the number/type of PS/worker instances in the script. Common usage includes:
74+
Common usage includes:
7575

76-
* Spawn cluster: `python init <branch> <num_workers>`
77-
* Launch experiments: `python run <branch> <num_workers>`
76+
* Spawn cluster: `python init <branch> <num_workers>`. This launches and initializes AWS EC2 instances with the environment and docker images.
77+
- The `INSTANCE_SPECS` variable in the script specifies the number of servers/workers and corresponding EC2 instance types, following the current tuple format.
78+
- `S3_FILES`: an array of files that are used as training data. Each file corresponds to one worker, so the array size must >= the number of workers.
79+
- `PS_MEMORY_MB`: server memory restriction.
80+
- `PS_NUM_CORES`: server number of cores.
81+
* Launch experiments: `python run <branch> <num_workers>`. This kills existing docker containers and runs all scheduler, servers and workers.
82+
- `TRAINING_FILE`: which python file in the directory to use for training.
83+
* Run workers only: `python run_workers_only <branch> <num_workers>`. This kills existing docker containers of workers and run workers, keeping the scheduler and servers alive.
84+
- This is useful when conducting recovery experiments, where you would want to SSH into servers and run workers with this script.
7885

7986
To trigger recovery, SSH into a PS host and kill and rerun its docker image.
8087

81-
Throughput metrics will be logged into the path specified by the `OUTPUT_DIR` variable in the experiment launching program. Refer to line 15 of [`criteo_training.py`](xdl/examples/criteo/criteo_training.py) to understand the numbers in the logged tuple. You may write a simple script to aggregate the throughput metrics across all hosts.
88+
Throughput metrics will be logged into the path specified by the `OUTPUT_DIR` variable in the experiment launching program. Refer to line 15 of [`criteo_training.py`](xdl/examples/criteo/criteo_training.py) to understand the numbers in the logged tuple. We provide a simple script [`parse_results.py`](parse_results.py) to aggregate the throughput metrics across all hosts for your reference.
89+
90+
## Reproducing Results in Paper
91+
92+
We ran a multitude of experiments to obtain results presented in the paper. Given the intricate nature of fault tolerance in distributed training, we regret that we cannot offer an end-to-end script to execute all experiments and generate the results. However, to ensure reproducibility, we have provided a comprehensive [list of configurations](EXP_CONFIGS.md) used in our experiments.
93+
94+
To replicate our results, please refer to the [Bulk Run (Experiments)](#bulk-run-experiments) section and perform a bulk run for each specified configuration. By following these instructions, you should be able to obtain results consistent with those reported in the paper.
8295

8396
## Support
8497
We graciously acknowledge support from the National Science Foundation (NSF) in the form of a Graduate Research Fellowship (DGE-1745016 and

parse_results.py

+29
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
import os
2+
3+
def parse(PATH, MAX_STEP=100):
4+
files = list(os.walk(PATH))
5+
files = files[0][2]
6+
files = [f'{PATH}/{f}' for f in files if f.startswith('exp_init_run_worker')]
7+
tot = [0]*MAX_STEP
8+
for path in files:
9+
lines = []
10+
with open(path, 'r') as f:
11+
for line in f.read().splitlines():
12+
if line.startswith('('):
13+
try:
14+
line = eval(line)
15+
lines.append(line)
16+
except Exception:
17+
continue
18+
for i, diff, time, acc in lines:
19+
if i >= MAX_STEP:
20+
break
21+
tot[i] += diff
22+
return tot
23+
24+
if not os.path.exists('results'):
25+
os.makedirs('results')
26+
27+
tup = parse('/path/to/your/results/dir')
28+
with open(f'results/out.txt', 'w') as f:
29+
f.write(str(tup))

0 commit comments

Comments
 (0)