This repository contains the source code, experiment logs, and result analyses for our ICDE 2023 paper "The Art of Losing to Win: Using Lossy Image Compression to Improve Data Loading in Deep Learning Pipelines". A preprint of the paper is available at [1].
(DL)² is an experiment sandbox that we built for our research.
(DL)² is developed on Ubuntu using NVIDIA GPUs and the following software:
- Python 3.8
- CUDA 10.2 and CUDA 11.1, depending on the GPU
- PyTorch 1.9
PyTorch 1.9 is necessary due to suboptimal performance on NVIDIA Ampere GPUs of previous releases (see [2]). For other libraries used in this project, please see the respective requirements files for CUDA 10 and CUDA 11.
Note: The requirements files only cover requirements to execute experiments
and gather log data. To run the analysis notebooks, a common Jupyter setup
with numpy, pandas, and matplotlib is required. To generate plots in the
same format as they appear in the paper, a LaTeX installation is necessary as
well. To analyze image quality with the analyze_image_dataset script, magick
must be installed and accessible on your $PATH.
Clone the repository to your machine and export its location as an environment variable.
git clone https://github.com/lbhm/dl2.git
export DL2_HOME=$(pwd)/dl2We recommend that you either directly copy or symlink your benchmark datasets
in a data/ directory at the top-level of this repo.
ln -s <path_to_your_data> $DL2_HOME/dataAs we compare different storage types in the paper, we created multiple data-x/
directories with x referring to the storage type.
You can run the experiments in a Docker container or directly on your system.
Switch to the docker/ directory and build the docker image.
cd docker
make buildThe docker container uses CUDA 11.1.
We recommend creating a virtual environment and installing the required packages into that environment.
virtualenv venv
source venv/bin/activate
pip install -r requirements_cuda11.txtThe main user interface of (DL)² is the dl2/main.py script. To get an overview
of possible parameters and their usage, run
python dl2/main.py -hOther scripts and helper tools all have a CLI documentation available via -h.
We provide a convenience script called run_docker.sh that starts a docker
container with some sensible runtime parameters set. Start the container by
running
run_docker.shThis will start a bash shell in which you can run experiments such as
python dl2/main.py -d <path_to_data> -a resnet50 -l dali-cpu -w 8 -b 256 -e 50 \
-p 100 --label-smoothing 0.1 --lr 0.256 --lr-policy cosine --mom 0.875 \
--wd 3.0517578125e-05 --amp --static-loss-scale 128 --memory-format nhwc \
-n docker-testAlternatively, you can directly pass your command to the run_docker.sh script:
run_docker.sh python <command>To execute code directly on your system, run the dl2/main.py script. For
example:
python dl2/main.py -d <path_to_data> -a resnet50 -l dali-cpu -w 8 -b 256 -e 50 \
-p 100 --label-smoothing 0.1 --lr 0.256 --lr-policy cosine --mom 0.875 \
--wd 3.0517578125e-05 --amp --static-loss-scale 128 --memory-format nhwc \
-n native-testTo run an experiment with multple GPUs or in a distributed setup, prepend your
command with the torch.distributed.run module. For example:
python -m torch.distributed.run --nproc_per_node 4 dl2/main.py <args>Experiment logs for all results that we report in the paper are in logs/. The
logs are organized by the correponding hypothesis that we invetigate in the paper.
The experiment names, such as inet-alex-ssd-raw-pytorch, encode important parameters
of the respective experiment. The full list of parameters is always logged in
the first line of each experiment_report.json file. In addition to the experiment
logs, logs/misc/ contains some additional summary plots about the datasets we
used.
Note: Some of the experiment folders in h5/ (learned compression) are empty
since the experiments did not finish within their time limit as we describe in
the paper.
The notebooks/ directory contains the Jupyter notebooks that we used for analyzing
the experiment results and creating the plots in our paper. All files in plots/
can be recreated by running the respective notebooks.
Note: A matplotlib-compatible TeX installation is required to recreate the
plots as they appear in our paper. Alternatively, with the DEBUG flag set to
True, the plots can be recreated without TeX though with a different layout.
To reproduce the results from our paper, execute the instructions provided in
scripts/command_list.sh. The commands assume a server infrastructure as we describe
in our paper and refer to three data folders (data-ssd, data-hdd, and data-sas)
that link to disks of the respective type.
Note: The command list was not designed to be executed in a fully automatic way so please read the comments. For example, we cannot provide a copy of the datasets that we use. To acquire a copy of ImageNet and Places365, please see the download instrcutions at [3] and [4].
Warning: Executing all the commands will take a very long time.
@inproceedings{behme_art_2023,
title = {
The Art of Losing to Win: Using Lossy Image Compression to Improve Data Loading in
Deep Learning Pipelines
},
author = {
Behme, Lennart and Thirumuruganathan, Saravanan and Mahdiraji, Alireza Rezaei and
Quian\'{e}-Ruiz, Jorge-Arnulfo and Markl, Volker
},
year = 2023,
booktitle = {{IEEE} 39th International Conference on Data Engineering ({ICDE})},
address = {Anaheim, CA, USA},
pages = {936--949},
doi = {10.1109/ICDE55515.2023.00077},
eventtitle = {{ICDE} '23}
}