-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Closed
Description
Hi,
I got some strange result when trying to train the KNN estimator with a gzipped CSV (what should be possible according to the docs).
This is my input setup:
train_channel = s3_input(
BUCKET, CSV_FILENAME, s3_data_type = 'S3Prefix',
compression = 'Gzip',
input_mode = 'Pipe',
content_type = 'text/csv')
With this input the KNN only processes a fraction of samples. If the CSV contains 1k samples and sample_size is also set to this value the output is:
[10/29/2019 08:54:52 INFO 140597917878080] Using default worker.
[10/29/2019 08:54:52 INFO 140597917878080] The channel 'train' is in pipe input mode under /opt/ml/input/data/train.
[10/29/2019 08:54:52 INFO 140597917878080] nvidia-smi took: 0.0251350402832 secs to identify 0 gpus
[10/29/2019 08:54:52 INFO 140597917878080] Create Store: dist_async
[10/29/2019 08:54:53 ERROR 140597917878080] nvidia-smi: failed to run (127): /bin/sh: nvidia-smi: command not found
[10/29/2019 08:54:53 INFO 140597917878080] Using per-worker sample size = 998 (Available virtual memory = 63325450240 bytes, GPU free memory = 0 bytes, number of workers = 1). If an out-of-memory error occurs, choose a larger instance type, use dimension reduction, decrease sample_size, and/or decrease mini_batch_size.
#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Records Seen": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Max Records Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Reset Count": {"count": 1, "max": 0, "sum": 0.0, "min": 0}}, "EndTime": 1572339293.719873, "Dimensions": {"Host": "algo-1", "Meta": "init_train_data_iter", "Operation": "training", "Algorithm": "AWS/KNN"}, "StartTime": 1572339293.719826}
[2019-10-29 08:54:53.720] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 0, "duration": 1216, "num_examples": 1, "num_bytes": 0}
[2019-10-29 08:54:54.088] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 1, "duration": 367, "num_examples": 1, "num_bytes": 0}
[10/29/2019 08:54:54 INFO 140597917878080] push reservoir to kv... 1 num_workers 0 rank
[10/29/2019 08:54:54 INFO 140597917878080] ...done (32)
[10/29/2019 08:54:54 INFO 140597917878080] #progress_metric: host=algo-1, completed 100 % of epochs
#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": {"count": 1, "max": 32, "sum": 32.0, "min": 32}, "Total Batches Seen": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Total Records Seen": {"count": 1, "max": 32, "sum": 32.0, "min": 32}, "Max Records Seen Between Resets": {"count": 1, "max": 32, "sum": 32.0, "min": 32}, "Reset Count": {"count": 1, "max": 1, "sum": 1.0, "min": 1}}, "EndTime": 1572339294.140734, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/KNN", "epoch": 0}, "StartTime": 1572339293.720163}
[10/29/2019 08:54:54 INFO 140597917878080] #throughput_metric: host=algo-1, train throughput=76.0667939191 records/second
[10/29/2019 08:54:54 INFO 140597917878080] pulled row count... worker 0 rows 32
[10/29/2019 08:54:54 INFO 140597917878080] pulled... worker 0 data (32, 25088) labels (32, 1) nans 0
[10/29/2019 08:54:54 INFO 140597917878080] calling index.train...
[10/29/2019 08:54:54 INFO 140597917878080] ...done calling index.train
[10/29/2019 08:54:54 INFO 140597917878080] calling index.add...
[10/29/2019 08:54:54 INFO 140597917878080] ...done calling index.add
#metrics {"Metrics": {"epochs": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "model.serialize.time": {"count": 1, "max": 2.814054489135742, "sum": 2.814054489135742, "min": 2.814054489135742}, "finalize.time": {"count": 1, "max": 245.2390193939209, "sum": 245.2390193939209, "min": 245.2390193939209}, "initialize.time": {"count": 1, "max": 841.8731689453125, "sum": 841.8731689453125, "min": 841.8731689453125}, "update.time": {"count": 1, "max": 420.35484313964844, "sum": 420.35484313964844, "min": 420.35484313964844}}, "EndTime": 1572339294.389102, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "AWS/KNN"}, "StartTime": 1572339292.502922}
[10/29/2019 08:54:54 INFO 140597917878080] Test data is not provided.
#metrics {"Metrics": {"totaltime": {"count": 1, "max": 2139.147996902466, "sum": 2139.147996902466, "min": 2139.147996902466}, "setuptime": {"count": 1, "max": 18.950939178466797, "sum": 18.950939178466797, "min": 18.950939178466797}}, "EndTime": 1572339294.40538, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "AWS/KNN"}, "StartTime": 1572339294.389177}
2019-10-29 08:55:02 Uploading - Uploading generated training model
2019-10-29 08:55:02 Completed - Training job completed
Here you see that only 32 samples were processed... the number also changes each time.
If I use unzipped CSVs the training works as expected...
Thanks in advance
Rob