Skip to content
This repository was archived by the owner on May 9, 2024. It is now read-only.
This repository was archived by the owner on May 9, 2024. It is now read-only.

[WIN] Plasticc benchmark crashes on the original plasticc dataset on HDK tasks execution #581

@gshimansky

Description

@gshimansky

Apparently plasticc no longer successfully completes when it is ran on the original plasticc dataset instead of the synthetic. This is true for both HDK version 0.6 and 0.7. On some systems execution ends silently, on some systems there is an error message
[2023-07-13 17:34:49.912154] [0x00005874] [info] 0 71 BufferMgr.cpp:720 Check failed: buffer_it->second->buffer. Debugging shows that it happens on the line that triggers HDK execution df_meta.shape # to trigger real execution.

You can reproduce the problem by checking out the benchmarks repo https://github.com/gshimansky/data-science-processing-workload.

To execute benchmark on the original dataset download test_set.csv, training_set.csv, test_set_metadata.csv and training_set_metadata.csv from modin datasets s3 bucket s3://modin-datasets/plasticc. You can execute benchmarks/plasticc.py directly like this:

set MODIN_STORAGE_FORMAT=hdk
set MODIN_ENGINE=native
set MODIN_EXPERIMENTAL=true
python benchmarks/plasticc.py training_set.csv test_set.csv training_set_metadata.csv test_set_metadata.csv

or you can rename these files into plasticc_training_set.csv, plasticc_test_set.csv, plasticc_training_set_metadata.csv and plasticc_test_set_metadata.csv respectively and running launcher.py with option -ru (reuse):

python launcher.py -m plasticc -ru --hdk

With -ru launcher skips generation stage and reuses dataset files already present in current directory.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions