Added reproducibility instructions

johnzhang1999 · johnzhang1999 · commit 273ef77f1736 · 2023-07-31T21:50:52.000-04:00
diff --git a/EXP_CONFIGS.md b/EXP_CONFIGS.md
@@ -0,0 +1,73 @@
+# Experiment Configurations for ECRec
+
+## Keys for Evaluation
+
+Note: ECRM, an earlier name of our system, refers to ECRec in this document.
+
+- Modes of running the DLRM training system. These are mostly controlled by which branch to run.
+
+  - XDL: original XDL system with no fault tolerance. This is the `master` branch.
+  - ECRM-kX: ECRM with value of parameter k being X. This is the version of ECRM described above including 2PC, neural network replication, etc. This is the `ecrec` branch. The parameter `k` is controlled by `PARITY_K` in [`xdl/ps-plus/ps-plus/common/base_parity_utils.h`](xdl/ps-plus/ps-plus/common/base_parity_utils.h). You will need to change `PARITY_N` at the same time.
+  - Ckpt-X: XDL with checkpointing, where checkpoints are written every X minutes. This is the `master` branch. The checkpoint frequency is specified in [`xdl/examples/criteo/criteo_training.py`](xdl/examples/criteo/criteo_training.py).
+- DLRMs on which to train. These are controlled by the variable settings in [`xdl/examples/criteo/criteo_training.py`](xdl/examples/criteo/criteo_training.py). The x and y in Criteo-xS-yD are controlled by the 4th and 5th argument to function `xdl.embedding` in these files. Note we have other scripts available in [`xdl/examples/criteo`](xdl/examples/criteo).
+  - Criteo: original Criteo DLRM
+  - Criteo-2S: as described in paper—2x the number of embedding table entries
+  - Criteo-2S-2D: as described in paper—2x the number of embedding table entries, and each entry being twice as wide (e.g., 64 dense rather than 32)
+    - Important note: see in paper that a difference instance type is used when running this setup
+- Batch size is 2048. This is set in [`xdl/examples/criteo/criteo_training.py`](xdl/examples/criteo/criteo_training.py).
+
+## Evaluations
+
+- Normal mode 1
+  - Using the same evaluation setup described in the paper
+    - For each of [Criteo, Criteo-2S, Criteo-2S-2D, Criteo-4S, Criteo-8S]
+      - For each of [XDL, Ckpt-30, Ckpt-60, ECRM-k2, ECRM-k4]
+        - Run the DLRM for 2 hours. Keep all of the logs
+  - At the end of this we will be able to plot Figures 5, 6, 7, 8, and 9
+- Recovery mode 1
+  - Using the same evaluation setup described in the paper
+    - For each of [Criteo, Criteo-2S, Criteo-2S-2D, Criteo-4S, Criteo-8S]
+      - For each of [Ckpt-30, Ckpt-60, ECRM-k2, ECRM-k4]
+        - Run the DLRM normally for 15 minutes, trigger a failure, and let each mode run for 60 additional minutes
+  - At the end of this we want to be able to plot Figures 10, 11, and 12
+- Normal mode 4: effect of NN replication, 2PC, etc.
+  - The goal here is to determine how much slower things like NN replication and 2PC make ECRM
+  - New key for this section:
+    - k=4.
+    - ECRM-kX: As described above
+    - ECRM-kX-minusNNRep: same as ECRM-kX, but without replicating NN params
+      - Another way of saying this is "without optimizer state" because then we can recover the "approximate" NN from one of the workers
+    - [Maybe don't do this one] ECRM-kX-minus2PC: same as ECRM-kX, but without using 2PC
+    - ECRM-kX-minusNNRep-minus2PC: same as ECRM-kX, but without replicating NN params and without using 2PC
+      - I think this is the same as Kaige's original branch
+  - For mode in [ECRM-k4, ECRM-k4-minusNNRep, ECRM-k4-minusNNRep-minus2PC (Kaige's branch)]
+    - For DLRM in [Criteo-8S, Criteo-4S, Criteo-2S, Criteo, Criteo-2S-2D]
+      - Run in normal mode. Save all the logs
+  - Here, we will want to compare the average throughput across each mode
+- Recovery mode 2: effect of lock granularity
+  - Using the same evaluation setup described in the paper
+    - For each of [Criteo-8S, Criteo] (later do Criteo-4S, Criteo-2S, Criteo-2S2D)
+      - For each mode in [ECRM-k4]
+        - For num\_locks in [1, 10]
+          - Trigger a failure, keep all the logs
+          - Measure how long it took to fully recover
+  - At the end of this we want to compare how long it takes to fully recover from a failure with 1 lock vs. with 10 locks
+- Normal mode 2: effect of limited resources
+  - Which DLRM we choose to run here depends on the results we see from Normal Mode 1
+  - For each mode in [ECRM-k4]
+    - For each DLRM in [Criteo-8S, Criteo]
+      - For each server instance type in [x1e.2xlarge, r5.8xlarge (note that this is different from r5n.8xlarge)]
+        - Run DLRM. Keep all of the logs
+  - At the end of this, we want to compare the average throughput to that when running on the original instances we considered
+- Normal mode 3: effect of increased number of workers
+  - For num\_workers in [5, 10, 15, 20, 25]
+    - For each DLRM in [Criteo-8S]
+      - For each mode in [XDL, ECRM-k4]
+        - Run the DLRM in normal mode. Save all the logs
+- Normal mode 3.1
+  - For num\_workers in [5, 10, 15, 20, 25]
+    - For each DLRM in [Criteo-4S, Criteo-2S, Criteo, Criteo-2S2D]
+      - For each mode in [XDL, ECRM-k4]
+        - Run the DLRM in normal mode. Save all the logs
+  - We want to plot something like Figure 9 from this
+  - Note that we don't need to rerun checkpointing for this, as the "steady-state" overhead from checkpointing can be computed based on the time it takes to write a checkpoint, the throughput of normal XDL, and the checkpointing frequency.
diff --git a/README.md b/README.md
@@ -69,16 +69,29 @@ Note that you may need to change/tune parameters in the above commands to obtain
 
 ## Bulk Run (Experiments)
 
-XDL/ECRec is designed to run distributedly on a set of hosts over a network. To enable repeatable and reproducible experiments, we provide a reference experiment launching program that allows spawning ECRec clusters on AWS and training on them with simple commands. The program can be found in this repo at [`launch_exp.py`](launch_exp.py). You will need to fill in AWS EC2 keypair and GitHub credentials information in the program script.
+XDL/ECRec is designed to run distributedly on a set of hosts over a network. To enable repeatable and reproducible experiments, we provide a reference experiment launching program that allows spawning ECRec clusters on AWS and training on them with simple commands. The program can be found in this repo at [`launch_exp.py`](launch_exp.py). You will need to fill in AWS EC2 keypair and GitHub credentials information in the program script. Additional useful variables to modify are described below.
 
-You can configure the number/type of PS/worker instances in the script. Common usage includes:
+Common usage includes:
 
-* Spawn cluster: `python init <branch> <num_workers>`
-* Launch experiments: `python run <branch> <num_workers>`
+* Spawn cluster: `python init <branch> <num_workers>`. This launches and initializes AWS EC2 instances with the environment and docker images.
+    - The `INSTANCE_SPECS` variable in the script specifies the number of servers/workers and corresponding EC2 instance types, following the current tuple format.
+    - `S3_FILES`: an array of files that are used as training data. Each file corresponds to one worker, so the array size must >= the number of workers.
+    - `PS_MEMORY_MB`: server memory restriction.
+    - `PS_NUM_CORES`: server number of cores.
+* Launch experiments: `python run <branch> <num_workers>`. This kills existing docker containers and runs all scheduler, servers and workers.
+    - `TRAINING_FILE`: which python file in the directory to use for training. 
+* Run workers only: `python run_workers_only <branch> <num_workers>`. This kills existing docker containers of workers and run workers, keeping the scheduler and servers alive. 
+    - This is useful when conducting recovery experiments, where you would want to SSH into servers and run workers with this script.
 
 To trigger recovery, SSH into a PS host and kill and rerun its docker image.
 
-Throughput metrics will be logged into the path specified by the `OUTPUT_DIR` variable in the experiment launching program. Refer to line 15 of [`criteo_training.py`](xdl/examples/criteo/criteo_training.py) to understand the numbers in the logged tuple. You may write a simple script to aggregate the throughput metrics across all hosts.
+Throughput metrics will be logged into the path specified by the `OUTPUT_DIR` variable in the experiment launching program. Refer to line 15 of [`criteo_training.py`](xdl/examples/criteo/criteo_training.py) to understand the numbers in the logged tuple. We provide a simple script [`parse_results.py`](parse_results.py) to aggregate the throughput metrics across all hosts for your reference.
+
+## Reproducing Results in Paper
+
+We ran a multitude of experiments to obtain results presented in the paper. Given the intricate nature of fault tolerance in distributed training, we regret that we cannot offer an end-to-end script to execute all experiments and generate the results. However, to ensure reproducibility, we have provided a comprehensive [list of configurations](EXP_CONFIGS.md) used in our experiments. 
+
+To replicate our results, please refer to the [Bulk Run (Experiments)](#bulk-run-experiments) section and perform a bulk run for each specified configuration. By following these instructions, you should be able to obtain results consistent with those reported in the paper.
 
 ## Support
 We graciously acknowledge support from the National Science Foundation (NSF) in the form of a Graduate Research Fellowship (DGE-1745016 and
diff --git a/parse_results.py b/parse_results.py
@@ -0,0 +1,29 @@
+import os
+
+def parse(PATH, MAX_STEP=100):
+    files = list(os.walk(PATH))
+    files = files[0][2]
+    files = [f'{PATH}/{f}' for f in files if f.startswith('exp_init_run_worker')]
+    tot = [0]*MAX_STEP
+    for path in files:
+        lines = []
+        with open(path, 'r') as f:
+            for line in f.read().splitlines():
+                if line.startswith('('):
+                    try:
+                        line = eval(line)
+                        lines.append(line)
+                    except Exception:
+                        continue
+        for i, diff, time, acc in lines:
+            if i >= MAX_STEP:
+                break
+            tot[i] += diff
+    return tot
+
+if not os.path.exists('results'):
+    os.makedirs('results')
+
+tup = parse('/path/to/your/results/dir')
+with open(f'results/out.txt', 'w') as f:
+    f.write(str(tup))