Skip to content

Commit f19daac

Browse files
authored
Improve documentation (CoactiveAI#2)
* Added docstrings to main function * Simplify print statements * Update readme with detailed instructions for alpha. Minor changes to other files to ensure readme instructions work as specified * Minor grammar fixes * Update links in readme
1 parent 4f4f89e commit f19daac

File tree

5 files changed

+245
-25
lines changed

5 files changed

+245
-25
lines changed

Dockerfile

+1-2
Original file line numberDiff line numberDiff line change
@@ -10,5 +10,4 @@ COPY *.py /app/
1010
COPY *.yaml /app/
1111

1212
WORKDIR /app/
13-
14-
ENTRYPOINT ["python3", "main.py"]
13+
ENTRYPOINT python3 main.py --docker_flag True

README.md

+208-2
Original file line numberDiff line numberDiff line change
@@ -1,2 +1,208 @@
1-
# Readme
2-
In progress ;)
1+
# `dataperf-visual-selection`: A Data-Centric Visual Benchmark for Training Data Selection
2+
### **Current version:** alpha
3+
This github repo serves as the starting point for offline evaluation of submissions for the training data selection visual benchmark. The offline evaluation can be run on both your local environment as well as a containerized image for reproducibility of score results.
4+
5+
For a detailed summary of the a benchmark, refer to the provided benchmark documentation.
6+
7+
## Requirements
8+
### Download resources
9+
The following resources will need to be downloaded locally in order to run offline evaluation:
10+
- Embeddings for candidate pool of training images (.parquet file)
11+
- Test sets for each classification task (.parquet files)
12+
13+
These resources can be downloaded in a .zip file at the following url
14+
15+
```
16+
https://drive.google.com/drive/folders/1wyb-EhmF5i2w7f8Ybqokdnrs6yjduiyL?usp=sharing
17+
```
18+
19+
### Install dependencies
20+
For running as a containerized image:
21+
- `docker` for building the containerized image
22+
- `docker-compose` for running the scoring service with the appropriate resources
23+
24+
Installation instructions can be found at the following links: [Docker](https://docs.docker.com/get-docker/), [Docker compose](https://docs.docker.com/compose/install/)
25+
26+
For running locally:
27+
- Python (>= 3.7)
28+
- An [appropriate version of Java](https://spark.apache.org/docs/latest/) for your version of `python` and `pyspark`
29+
30+
The current version of this repo has only been tested locally on python 3.9 and java openjdk-11.
31+
32+
33+
## Installation
34+
35+
Clone this repo to your local machine
36+
37+
```
38+
git clone [email protected]:CoactiveAI/dataperf-vision-selection.git
39+
```
40+
41+
If you want to run the offline evaluation in your local environment, install the required python packages
42+
43+
```
44+
pip install -r dataperf-vision-selection/requirements.txt
45+
```
46+
47+
A template filesystem with the following structure is provided in the repo. Move the embeddings file and the tests sets to the appropriate folders in this template filesystem
48+
49+
```
50+
unzip dataperf-visual-selection-resources.zip
51+
mv dataperf-visual-selection-resources/embeddings/* dataperf-vision-selection/data/embeddings/
52+
mv dataperf-visual-selection-resources/test_sets/* dataperf-vision-selection/data/test_sets/
53+
```
54+
55+
The resulting filesystem in the repo should look as follows
56+
```
57+
|____data
58+
| |____embeddings
59+
| | |____train_emb_256_dataperf.parquet
60+
| |____test_sets
61+
| | |____alpha_test_set_Hawk_256.parquet
62+
| | |____alpha_test_set_Cupcake_256.parquet
63+
| | |____alpha_test_set_Sushi_256.parquet
64+
| |____train_sets
65+
| | |____random_500.csv
66+
| |____results
67+
| | |____result_for_random_500.json
68+
```
69+
70+
With the resources in place, you can now test that the system is functioning as expected.
71+
72+
To test the containerized offline evaluation, run
73+
74+
```
75+
cd dataperf-vision-selection
76+
docker-compose up
77+
```
78+
79+
Similarly, to test the local python offline evaluation, run
80+
81+
```
82+
cd dataperf-vision-selection
83+
python3 main.py
84+
```
85+
86+
Either test will run the offline evaluation using the setup specified in `task_setup.yaml`, which will utilize a training set of randomly sampled and labeled data points (`data/train_sets/random_500.csv`) to generate a score results file in `data/results/` with a unique UTC timestamp
87+
88+
```
89+
|____data
90+
| |____results
91+
| | |____result_for_random_500.json
92+
| | |____result_UTC-2022-03-31-20-19-24.json
93+
```
94+
95+
The generated scores in this new results file should be identical to those in `data/results_for_random_500.json`.
96+
97+
# Guidelines (alpha version)
98+
For the alpha version of this benchmark we will only support submissions and offline evaluation for the open division.
99+
100+
## Open Division: Creating a submission
101+
A valid submission for the open division includes the following:
102+
- A description of the data selection algorithm/strategy used
103+
- A training set for each classification task as specified below
104+
- (Optional) A script of the algorithm/strategy used
105+
106+
Each training set file must be a .csv file containing two columns: `ImageID` (the unique identifier for the image) and `Confidence` (the binary label, either a `0` or `1`). The `ImageID`s in the training set files must be limited to the provided candidate pool of training images (i.e. `ImageID`s in the downloaded embeddings file).
107+
108+
The included training set file serves as a template of a single training set:
109+
```
110+
cat dataperf-vision-selection/data/train_sets/random_500.csv
111+
112+
ImageID,Confidence
113+
0002643773a76876,0
114+
0016a0f096337445,0
115+
0036043ce525479b,1
116+
00526f123f84db2f,1
117+
0080db2599d54447,1
118+
00978577e9fdd967,1
119+
...
120+
```
121+
122+
## Open Division: Offline evaluation of a submission
123+
124+
The configuration for the offline evaluation is specified in `task_setup.yaml` file. For simplicity, the repo comes pre-configured such that for offline evaluation you can simply:
125+
1. Copy your training sets to the template filesystem
126+
2. Modify the config file to specify the training set for each task
127+
3. Run offline evaluation
128+
4. See results in stdout and results file in `data/results/`
129+
130+
For example
131+
```
132+
# 1. Copy training sets for each task
133+
cd dataperf-vision-selection
134+
cp /path/to/your/training/sets/Cupcake.csv data/train_sets/
135+
cp /path/to/your/training/sets/Hawk.csv data/train_sets/
136+
cp /path/to/your/training/sets/Sushi.csv data/train_sets/
137+
138+
# 2. task_setup.yaml: modify the training set relative path for each classification task
139+
Cupcake: ['train_sets/Cupcake.csv', 'test_sets/alpha_test_set_Cupcake_256.parquet']
140+
Hawk: ['train_sets/Hawk.csv', 'test_sets/alpha_test_set_Hawk_256.parquet']
141+
Sushi: ['train_sets/Sushi.csv', 'test_sets/alpha_test_set_Sushi_256.parquet']
142+
143+
# 3a. Run offline evaluation (docker)
144+
docker-compose up --build --force-recreate
145+
146+
# 3b. Run offline evaluation (local python)
147+
python3 main.py
148+
149+
# 4. See results (file will have save timestamp in name)
150+
cat data/results/result_UTC-2022-03-31-20-19-24.json
151+
152+
{
153+
"Cupcake": {
154+
"accuracy": 0.5401459854014599,
155+
"recall": 0.463768115942029,
156+
"precision": 0.5517241379310345,
157+
"f1": 0.5039370078740157
158+
},
159+
"Hawk": {
160+
"accuracy": 0.296551724137931,
161+
"recall": 0.16831683168316833,
162+
"precision": 0.4857142857142857,
163+
"f1": 0.25000000000000006
164+
},
165+
"Sushi": {
166+
"accuracy": 0.5185185185185185,
167+
"recall": 0.6261682242990654,
168+
"precision": 0.638095238095238,
169+
"f1": 0.6320754716981132
170+
}
171+
}
172+
```
173+
174+
Though we recommend working as described above, you can specify a custom task setup .yaml file and/or data folder if needed.
175+
176+
For the containerized offline evaluation, modify the following files and run as follows
177+
```
178+
# docker-compose.yaml: modify the volume source
179+
volumes:
180+
- path/to/your/data/folder:/app/data
181+
182+
# Dockerfile: modify the COPY *.yaml command and specify the new file in the entrypoint
183+
COPY path/to/your/custom_task_setup.yaml /app/
184+
...
185+
ENTRYPOINT python3 main.py --docker_flag True --setup_yaml_path 'custom_task_setup.yaml'
186+
187+
# Run and force rebuild
188+
docker-compose up --build --force-recreate
189+
```
190+
191+
For the local python offline evaluation, modify the following files and run as follows
192+
```
193+
# path/to/your/custom_task_setup.yaml: modify data_dir
194+
data_dir: 'path/to/your/data/folder'
195+
196+
# Run and specify custom .yaml file
197+
python3 main.py --setup_yaml_path 'path/to/your/custom_task_setup.yaml'
198+
```
199+
200+
*Note: when specifying a data folder, ensure all relative paths in the task setup .yaml file are valid*
201+
202+
203+
## Closed Division: Creating a submission
204+
TBD.
205+
206+
207+
## Closed Division: Offline evaluation of a submission
208+
TBD.

constants.py

+2-1
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
# For setup file
22
DEFAULT_SETUP_YAML_PATH = 'task_setup.yaml'
3-
SETUP_YAML_DATA_DIR_KEY = 'data_dir'
3+
SETUP_YAML_LOCAL_DATA_DIR_KEY = 'data_dir'
4+
SETUP_YAML_DOCKER_DATA_DIR_KEY = 'docker_data_dir'
45
SETUP_YAML_DIM_KEY = 'dim'
56
SETUP_YAML_TASKS_KEY = 'eval_tasks'
67
SETUP_YAML_EMB_KEY = 'emb_file'

main.py

+19-13
Original file line numberDiff line numberDiff line change
@@ -7,17 +7,27 @@
77
import eval as eval
88

99

10-
def main(setup_yaml_path=c.DEFAULT_SETUP_YAML_PATH):
10+
def run_tasks(
11+
setup_yaml_path: str = c.DEFAULT_SETUP_YAML_PATH,
12+
docker_flag: bool = False) -> None:
13+
"""Runs visual benchmark tasks based on config yaml file.
14+
15+
Args:
16+
setup_yaml_path (str, optional): Path for config file. Defaults
17+
to path in constants.DEFAULT_SETUP_YAML_PATH.
18+
docker_flag (bool, optional): True when running in container
19+
"""
1120
task_setup = utils.load_yaml(setup_yaml_path)
12-
data_dir = task_setup[c.SETUP_YAML_DATA_DIR_KEY]
21+
data_dir_key = c.SETUP_YAML_DOCKER_DATA_DIR_KEY if docker_flag \
22+
else c.SETUP_YAML_LOCAL_DATA_DIR_KEY
23+
data_dir = task_setup[data_dir_key]
1324
dim = task_setup[c.SETUP_YAML_DIM_KEY]
1425
emb_path = os.path.join(data_dir, task_setup[c.SETUP_YAML_EMB_KEY])
1526

1627
ss = utils.get_spark_session(task_setup[c.SETUP_YAML_SPARK_MEM_KEY])
1728

18-
print('\nLoading embeddings...', end='')
29+
print('Loading embeddings\n')
1930
emb_df = utils.load_emb_df(ss=ss, path=emb_path, dim=dim)
20-
print('Done\n')
2131

2232
task_paths = {
2333
task: task_setup[task] for task in task_setup[c.SETUP_YAML_TASKS_KEY]}
@@ -26,26 +36,22 @@ def main(setup_yaml_path=c.DEFAULT_SETUP_YAML_PATH):
2636
print(f'Evaluating task: {task}')
2737
train_path, test_path = [os.path.join(data_dir, p) for p in paths]
2838

29-
print(f'Loading training data for {task}...', end='')
39+
print(f'Loading training data for {task}...')
3040
train_df = utils.load_train_df(ss=ss, path=train_path)
3141
train_df = utils.add_emb_col(df=train_df, emb_df=emb_df)
32-
print('Done')
3342

34-
print(f'Loading test data for {task}...', end='')
43+
print(f'Loading test data for {task}...')
3544
test_df = utils.load_test_df(ss=ss, path=test_path, dim=dim)
36-
print('Done')
3745

38-
print(f'Training classifier for {task}...', end='')
46+
print(f'Training classifier for {task}...')
3947
clf = eval.get_trained_classifier(df=train_df)
40-
print('Done')
4148

42-
print(f'Scoring trained classifier for {task}...', end='')
49+
print(f'Scoring trained classifier for {task}...\n')
4350
task_scores[task] = eval.score_classifier(df=test_df, clf=clf)
44-
print('Done\n')
4551

4652
save_dir = os.path.join(data_dir, task_setup[c.SETUP_YAML_RESULTS_KEY])
4753
utils.save_results(data=task_scores, save_dir=save_dir, verbose=True)
4854

4955

5056
if __name__ == "__main__":
51-
fire.Fire(main)
57+
fire.Fire(run_tasks)

task_setup.yaml

+15-7
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,4 @@
1-
# Embedding dimensionality
2-
dim: 256
3-
4-
# Spark driver memory (recommend > 6g)
1+
# Memory used by Spark driver (recommend > 6g)
52
spark_driver_memory: '6g'
63

74
# Path for data directory. Note that all other paths in this setup
@@ -13,13 +10,24 @@ data_dir: 'data'
1310
# Relative path for embedding file
1411
emb_file: 'embeddings/train_emb_256_dataperf.parquet'
1512

16-
# Task name: [relative path to training set, relative path to test set]
13+
# Paths to training and test files for each tasks. Task names must match names
14+
# in eval_tasks above. See example below:
15+
# task_name: ['relative/path/to/training_set.csv', 'relative/path/to/test_set.csv']
1716
Cupcake: ['train_sets/random_500.csv', 'test_sets/alpha_test_set_Cupcake_256.parquet']
1817
Hawk: ['train_sets/random_500.csv', 'test_sets/alpha_test_set_Hawk_256.parquet']
1918
Sushi: ['train_sets/random_500.csv', 'test_sets/alpha_test_set_Sushi_256.parquet']
2019

20+
# Relative path for results file
21+
results_dir: 'results'
22+
23+
# Parameters specific to alpha below (likely not useful to modify)
24+
25+
# Embedding dimensionality
26+
dim: 256
27+
2128
# List the names of tasks to evaluate
2229
eval_tasks: ['Cupcake','Hawk','Sushi']
2330

24-
# Relative path for results file
25-
results_dir: 'results'
31+
# Path for data directory in docker image.
32+
# DO NOT MODIFY
33+
docker_data_dir: 'data'

0 commit comments

Comments
 (0)