Skip to content

KNN: s3_input with Pipe/CSV/GZIP not working #1109

@moebelde-rs

Description

@moebelde-rs

Hi,

I got some strange result when trying to train the KNN estimator with a gzipped CSV (what should be possible according to the docs).
This is my input setup:

train_channel = s3_input(
                             BUCKET,  CSV_FILENAME, s3_data_type = 'S3Prefix', 
                             compression = 'Gzip', 
                             input_mode = 'Pipe', 
                             content_type = 'text/csv')

With this input the KNN only processes a fraction of samples. If the CSV contains 1k samples and sample_size is also set to this value the output is:

[10/29/2019 08:54:52 INFO 140597917878080] Using default worker.
[10/29/2019 08:54:52 INFO 140597917878080] The channel 'train' is in pipe input mode under /opt/ml/input/data/train.
[10/29/2019 08:54:52 INFO 140597917878080] nvidia-smi took: 0.0251350402832 secs to identify 0 gpus
[10/29/2019 08:54:52 INFO 140597917878080] Create Store: dist_async
[10/29/2019 08:54:53 ERROR 140597917878080] nvidia-smi: failed to run (127): /bin/sh: nvidia-smi: command not found
[10/29/2019 08:54:53 INFO 140597917878080] Using per-worker sample size = 998 (Available virtual memory = 63325450240 bytes, GPU free memory = 0 bytes, number of workers = 1). If an out-of-memory error occurs, choose a larger instance type, use dimension reduction, decrease sample_size, and/or decrease mini_batch_size.
#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Batches Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Number of Records Since Last Reset": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Batches Seen": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Total Records Seen": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Max Records Seen Between Resets": {"count": 1, "max": 0, "sum": 0.0, "min": 0}, "Reset Count": {"count": 1, "max": 0, "sum": 0.0, "min": 0}}, "EndTime": 1572339293.719873, "Dimensions": {"Host": "algo-1", "Meta": "init_train_data_iter", "Operation": "training", "Algorithm": "AWS/KNN"}, "StartTime": 1572339293.719826}

[2019-10-29 08:54:53.720] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 0, "duration": 1216, "num_examples": 1, "num_bytes": 0}
[2019-10-29 08:54:54.088] [tensorio] [info] epoch_stats={"data_pipeline": "/opt/ml/input/data/train", "epoch": 1, "duration": 367, "num_examples": 1, "num_bytes": 0}
[10/29/2019 08:54:54 INFO 140597917878080] push reservoir to kv... 1 num_workers 0 rank
[10/29/2019 08:54:54 INFO 140597917878080] ...done (32)
[10/29/2019 08:54:54 INFO 140597917878080] #progress_metric: host=algo-1, completed 100 % of epochs
#metrics {"Metrics": {"Max Batches Seen Between Resets": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Batches Since Last Reset": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Number of Records Since Last Reset": {"count": 1, "max": 32, "sum": 32.0, "min": 32}, "Total Batches Seen": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "Total Records Seen": {"count": 1, "max": 32, "sum": 32.0, "min": 32}, "Max Records Seen Between Resets": {"count": 1, "max": 32, "sum": 32.0, "min": 32}, "Reset Count": {"count": 1, "max": 1, "sum": 1.0, "min": 1}}, "EndTime": 1572339294.140734, "Dimensions": {"Host": "algo-1", "Meta": "training_data_iter", "Operation": "training", "Algorithm": "AWS/KNN", "epoch": 0}, "StartTime": 1572339293.720163}

[10/29/2019 08:54:54 INFO 140597917878080] #throughput_metric: host=algo-1, train throughput=76.0667939191 records/second
[10/29/2019 08:54:54 INFO 140597917878080] pulled row count... worker 0 rows 32
[10/29/2019 08:54:54 INFO 140597917878080] pulled... worker 0 data (32, 25088) labels (32, 1) nans 0
[10/29/2019 08:54:54 INFO 140597917878080] calling index.train...
[10/29/2019 08:54:54 INFO 140597917878080] ...done calling index.train
[10/29/2019 08:54:54 INFO 140597917878080] calling index.add...
[10/29/2019 08:54:54 INFO 140597917878080] ...done calling index.add
#metrics {"Metrics": {"epochs": {"count": 1, "max": 1, "sum": 1.0, "min": 1}, "model.serialize.time": {"count": 1, "max": 2.814054489135742, "sum": 2.814054489135742, "min": 2.814054489135742}, "finalize.time": {"count": 1, "max": 245.2390193939209, "sum": 245.2390193939209, "min": 245.2390193939209}, "initialize.time": {"count": 1, "max": 841.8731689453125, "sum": 841.8731689453125, "min": 841.8731689453125}, "update.time": {"count": 1, "max": 420.35484313964844, "sum": 420.35484313964844, "min": 420.35484313964844}}, "EndTime": 1572339294.389102, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "AWS/KNN"}, "StartTime": 1572339292.502922}

[10/29/2019 08:54:54 INFO 140597917878080] Test data is not provided.
#metrics {"Metrics": {"totaltime": {"count": 1, "max": 2139.147996902466, "sum": 2139.147996902466, "min": 2139.147996902466}, "setuptime": {"count": 1, "max": 18.950939178466797, "sum": 18.950939178466797, "min": 18.950939178466797}}, "EndTime": 1572339294.40538, "Dimensions": {"Host": "algo-1", "Operation": "training", "Algorithm": "AWS/KNN"}, "StartTime": 1572339294.389177}


2019-10-29 08:55:02 Uploading - Uploading generated training model
2019-10-29 08:55:02 Completed - Training job completed 

Here you see that only 32 samples were processed... the number also changes each time.

If I use unzipped CSVs the training works as expected...

Thanks in advance
Rob

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions