Description
What I need help with / What I was wondering
I’m trying to prepare a custom beam dataset called librilight
using TFDS on Google Cloud Dataflow. I followed instructions on tfds new and beam dataset, and was able to run the beam pipeline successfully with DirectRunner
locally.
But it failed on Dataflow workers with error ModuleNotFoundError: No module named librilight
where librilight
is my custom dataset module name.
What I've tried so far
-
To tell Dataflow workers to install tfds with my custom dataset, I used my git repo. I have
echo "https://github.com/zhiyun/datasets/archive/librilight.tar.gz" > /tmp/beam_requirements.txt
. However, the worker failed with module not found error. -
I have
save_main_session
option enabled in beam_pipeline_options, but it didn't help. -
I also tried to build a docker image, but it failed with the same module not found error.
Here is the full error log.
Traceback (most recent call last):
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/bin/tfds", line 8, in <module>
sys.exit(launch_cli())
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/main.py", line 104, in launch_cli
app.run(main, flags_parser=_parse_flags)
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/absl/app.py", line 308, in run
_run_main(main, args)
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main
sys.exit(main(argv))
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/main.py", line 99, in main
args.subparser_fn(args)
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/build.py", line 233, in _build_datasets
_download_and_prepare(args, builder)
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/build.py", line 435, in _download_and_prepare
builder.download_and_prepare(
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 600, in download_and_prepar
e
self._download_and_prepare(
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1405, in _download_and_prep
are
split_info_futures.append(future)
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/contextlib.py", line 120, in __exit__
next(self.gen)
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/core/split_builder.py", line 198, in maybe_beam_pipeline
self._beam_pipeline.__exit__(None, None, None)
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/apache_beam/pipeline.py", line 598, in __exit__
self.result.wait_until_finish()
File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1641, in wait_until_fin
ish
raise DataflowRuntimeException(
apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/dill_pickler.py", line 285, in loads
return dill.loads(s)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads
return load(file, ignore, **kwds)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load
return Unpickler(file, ignore=ignore, **kwds).load()
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load
obj = StockUnpickler.load(self)
File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 826, in _import_module
return __import__(import_name)
ModuleNotFoundError: No module named 'librilight'
Here is my script
echo "https://github.com/zhiyun/datasets/archive/v4.8.0-branch.tar.gz" > /tmp/beam_requirements.txt
echo "wrapt" >> /tmp/beam_requirements.txt
echo "pydub" >> /tmp/beam_requirements.txt
tfds build tensorflow_datasets/datasets/librilight/ \
--manual_dir=${SOURCE_DIR} \
--data_dir=${DATA_DIR} \
--beam_pipeline_options=\
"runner=DataflowRunner,"\
"region=${REGION},"\
"project=${GCP_PROJECT},"\
"job_name=librilight-gen-${DATE},"\
"staging_location=gs://${TEMP_BUCKET}/binaries/,"\
"temp_location=gs://${TEMP_BUCKET}/tmp/,"\
"[service_account_email=ml-training@cybertron-gcp-island-test-0rxn.iam.gserviceaccount.com](mailto:service_account_email=ml-training@cybertron-gcp-island-test-0rxn.iam.gserviceaccount.com),"\
"network=cybertron-gcp-island-test-0rxn-usc1-island-vpc,"\
"subnetwork=https://www.googleapis.com/compute/v1/projects/gns-network-prod-0d38/regions/us-central1/subnetworks/cybertron-gcp-island-test-0rxn-usc1-priv-island,"\
"dataflow_service_options=enable_secure_boot,"\
"experiments=use_network_tags=allow-internet-egress,"\
"no_use_public_ips,"\
"requirements_file=/tmp/beam_requirements.txt,"\
"save_main_session"
My dockerfile
# syntax=docker/dockerfile:1
FROM apache/beam_python3.8_sdk:2.43.0
# Pre-built python dependencies
RUN pip install https://github.com/zhiyun/datasets/archive/librilight.tar.gz
RUN pip install wrapt
RUN pip install pydub
# Pre-built other dependencies
# RUN apt-get update \
# && apt-get dist-upgrade \
# && apt-get install -y --no-install-recommends ffmpeg
# Set the entrypoint to the Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]
It would be nice if...
tfds documentation could cover this case where we need to install tfds with custom dataset on Google Cloud Dataflow workers.
Environment information
(if applicable)
- Operating System:
- Python version: python 3.8
tensorflow-datasets
/tfds-nightly
version: I have tried both tensorflow-dataset 4.8.1 and tfds-nightly. Both failed with the same errors.tensorflow
/tensorflow-gpu
/tf-nightly
/tf-nightly-gpu
version: