|
| 1 | +# Experiment Configurations for ECRec |
| 2 | + |
| 3 | +## Keys for Evaluation |
| 4 | + |
| 5 | +Note: ECRM, an earlier name of our system, refers to ECRec in this document. |
| 6 | + |
| 7 | +- Modes of running the DLRM training system. These are mostly controlled by which branch to run. |
| 8 | + |
| 9 | + - XDL: original XDL system with no fault tolerance. This is the `master` branch. |
| 10 | + - ECRM-kX: ECRM with value of parameter k being X. This is the version of ECRM described above including 2PC, neural network replication, etc. This is the `ecrec` branch. The parameter `k` is controlled by `PARITY_K` in [`xdl/ps-plus/ps-plus/common/base_parity_utils.h`](xdl/ps-plus/ps-plus/common/base_parity_utils.h). You will need to change `PARITY_N` at the same time. |
| 11 | + - Ckpt-X: XDL with checkpointing, where checkpoints are written every X minutes. This is the `master` branch. The checkpoint frequency is specified in [`xdl/examples/criteo/criteo_training.py`](xdl/examples/criteo/criteo_training.py). |
| 12 | +- DLRMs on which to train. These are controlled by the variable settings in [`xdl/examples/criteo/criteo_training.py`](xdl/examples/criteo/criteo_training.py). The x and y in Criteo-xS-yD are controlled by the 4th and 5th argument to function `xdl.embedding` in these files. Note we have other scripts available in [`xdl/examples/criteo`](xdl/examples/criteo). |
| 13 | + - Criteo: original Criteo DLRM |
| 14 | + - Criteo-2S: as described in paper—2x the number of embedding table entries |
| 15 | + - Criteo-2S-2D: as described in paper—2x the number of embedding table entries, and each entry being twice as wide (e.g., 64 dense rather than 32) |
| 16 | + - Important note: see in paper that a difference instance type is used when running this setup |
| 17 | +- Batch size is 2048. This is set in [`xdl/examples/criteo/criteo_training.py`](xdl/examples/criteo/criteo_training.py). |
| 18 | + |
| 19 | +## Evaluations |
| 20 | + |
| 21 | +- Normal mode 1 |
| 22 | + - Using the same evaluation setup described in the paper |
| 23 | + - For each of [Criteo, Criteo-2S, Criteo-2S-2D, Criteo-4S, Criteo-8S] |
| 24 | + - For each of [XDL, Ckpt-30, Ckpt-60, ECRM-k2, ECRM-k4] |
| 25 | + - Run the DLRM for 2 hours. Keep all of the logs |
| 26 | + - At the end of this we will be able to plot Figures 5, 6, 7, 8, and 9 |
| 27 | +- Recovery mode 1 |
| 28 | + - Using the same evaluation setup described in the paper |
| 29 | + - For each of [Criteo, Criteo-2S, Criteo-2S-2D, Criteo-4S, Criteo-8S] |
| 30 | + - For each of [Ckpt-30, Ckpt-60, ECRM-k2, ECRM-k4] |
| 31 | + - Run the DLRM normally for 15 minutes, trigger a failure, and let each mode run for 60 additional minutes |
| 32 | + - At the end of this we want to be able to plot Figures 10, 11, and 12 |
| 33 | +- Normal mode 4: effect of NN replication, 2PC, etc. |
| 34 | + - The goal here is to determine how much slower things like NN replication and 2PC make ECRM |
| 35 | + - New key for this section: |
| 36 | + - k=4. |
| 37 | + - ECRM-kX: As described above |
| 38 | + - ECRM-kX-minusNNRep: same as ECRM-kX, but without replicating NN params |
| 39 | + - Another way of saying this is "without optimizer state" because then we can recover the "approximate" NN from one of the workers |
| 40 | + - [Maybe don't do this one] ECRM-kX-minus2PC: same as ECRM-kX, but without using 2PC |
| 41 | + - ECRM-kX-minusNNRep-minus2PC: same as ECRM-kX, but without replicating NN params and without using 2PC |
| 42 | + - I think this is the same as Kaige's original branch |
| 43 | + - For mode in [ECRM-k4, ECRM-k4-minusNNRep, ECRM-k4-minusNNRep-minus2PC (Kaige's branch)] |
| 44 | + - For DLRM in [Criteo-8S, Criteo-4S, Criteo-2S, Criteo, Criteo-2S-2D] |
| 45 | + - Run in normal mode. Save all the logs |
| 46 | + - Here, we will want to compare the average throughput across each mode |
| 47 | +- Recovery mode 2: effect of lock granularity |
| 48 | + - Using the same evaluation setup described in the paper |
| 49 | + - For each of [Criteo-8S, Criteo] (later do Criteo-4S, Criteo-2S, Criteo-2S2D) |
| 50 | + - For each mode in [ECRM-k4] |
| 51 | + - For num\_locks in [1, 10] |
| 52 | + - Trigger a failure, keep all the logs |
| 53 | + - Measure how long it took to fully recover |
| 54 | + - At the end of this we want to compare how long it takes to fully recover from a failure with 1 lock vs. with 10 locks |
| 55 | +- Normal mode 2: effect of limited resources |
| 56 | + - Which DLRM we choose to run here depends on the results we see from Normal Mode 1 |
| 57 | + - For each mode in [ECRM-k4] |
| 58 | + - For each DLRM in [Criteo-8S, Criteo] |
| 59 | + - For each server instance type in [x1e.2xlarge, r5.8xlarge (note that this is different from r5n.8xlarge)] |
| 60 | + - Run DLRM. Keep all of the logs |
| 61 | + - At the end of this, we want to compare the average throughput to that when running on the original instances we considered |
| 62 | +- Normal mode 3: effect of increased number of workers |
| 63 | + - For num\_workers in [5, 10, 15, 20, 25] |
| 64 | + - For each DLRM in [Criteo-8S] |
| 65 | + - For each mode in [XDL, ECRM-k4] |
| 66 | + - Run the DLRM in normal mode. Save all the logs |
| 67 | +- Normal mode 3.1 |
| 68 | + - For num\_workers in [5, 10, 15, 20, 25] |
| 69 | + - For each DLRM in [Criteo-4S, Criteo-2S, Criteo, Criteo-2S2D] |
| 70 | + - For each mode in [XDL, ECRM-k4] |
| 71 | + - Run the DLRM in normal mode. Save all the logs |
| 72 | + - We want to plot something like Figure 9 from this |
| 73 | + - Note that we don't need to rerun checkpointing for this, as the "steady-state" overhead from checkpointing can be computed based on the time it takes to write a checkpoint, the throughput of normal XDL, and the checkpointing frequency. |
0 commit comments