Skip to content

How to install tfds on the workers when generating a custom beam dataset on Google Cloud Dataflow? #4616

Open
@zhiyun

Description

@zhiyun

What I need help with / What I was wondering
I’m trying to prepare a custom beam dataset called librilight using TFDS on Google Cloud Dataflow. I followed instructions on tfds new and beam dataset, and was able to run the beam pipeline successfully with DirectRunner locally.

But it failed on Dataflow workers with error ModuleNotFoundError: No module named librilight where librilight is my custom dataset module name.

What I've tried so far

  • To tell Dataflow workers to install tfds with my custom dataset, I used my git repo. I have echo "https://github.com/zhiyun/datasets/archive/librilight.tar.gz" > /tmp/beam_requirements.txt. However, the worker failed with module not found error.

  • I have save_main_session option enabled in beam_pipeline_options, but it didn't help.

  • I also tried to build a docker image, but it failed with the same module not found error.

Here is the full error log.

Traceback (most recent call last):                                                                                                                      

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/bin/tfds", line 8, in <module>                                                                            

    sys.exit(launch_cli())                                                                                                                              

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/main.py", line 104, in launch_cli             

    app.run(main, flags_parser=_parse_flags)                                                                                                            

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/absl/app.py", line 308, in run                                                

    _run_main(main, args)                                                                                                                               

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/absl/app.py", line 254, in _run_main                                          

    sys.exit(main(argv))                                                                                                                                

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/main.py", line 99, in main                    

    args.subparser_fn(args)                                                                                                                             

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/build.py", line 233, in _build_datasets       

    _download_and_prepare(args, builder)                                                                                                                

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/scripts/cli/build.py", line 435, in _download_and_prepare 

    builder.download_and_prepare(                                                                                                                       

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 600, in download_and_prepar

e                                                                                                                                                       

    self._download_and_prepare(                                                                                                                         

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/core/dataset_builder.py", line 1405, in _download_and_prep

are                                                                                                                                                     

    split_info_futures.append(future)                                                                                                                   

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/contextlib.py", line 120, in __exit__                                                       

    next(self.gen)                                                                                                                                      

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/tensorflow_datasets/core/split_builder.py", line 198, in maybe_beam_pipeline  

    self._beam_pipeline.__exit__(None, None, None)                                                                                                      

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/apache_beam/pipeline.py", line 598, in __exit__                               

    self.result.wait_until_finish()                                                                                                                     

  File "/Users/zhiyunlu/miniforge3/envs/tfdsn/lib/python3.8/site-packages/apache_beam/runners/dataflow/dataflow_runner.py", line 1641, in wait_until_fin

ish                                                                                                                                                     

    raise DataflowRuntimeException(                                                                                                                     

apache_beam.runners.dataflow.dataflow_runner.DataflowRuntimeException: Dataflow pipeline failed. State: FAILED, Error:                                  

Traceback (most recent call last):                                                                                                                      

  File "/usr/local/lib/python3.8/site-packages/apache_beam/internal/dill_pickler.py", line 285, in loads                                                

    return dill.loads(s)                                                                                                                                

  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 275, in loads                                                                       

    return load(file, ignore, **kwds)                                                                                                                   

  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 270, in load                                                                        

    return Unpickler(file, ignore=ignore, **kwds).load()                                                                                                

  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 472, in load                                                                        

    obj = StockUnpickler.load(self)                                                                                                                     

  File "/usr/local/lib/python3.8/site-packages/dill/_dill.py", line 826, in _import_module                                                              

    return __import__(import_name)                                                                                                                      

ModuleNotFoundError: No module named 'librilight' 

Here is my script

echo "https://github.com/zhiyun/datasets/archive/v4.8.0-branch.tar.gz" > /tmp/beam_requirements.txt
echo "wrapt" >> /tmp/beam_requirements.txt
echo "pydub" >> /tmp/beam_requirements.txt

tfds build tensorflow_datasets/datasets/librilight/ \
--manual_dir=${SOURCE_DIR} \
--data_dir=${DATA_DIR} \
--beam_pipeline_options=\
"runner=DataflowRunner,"\
"region=${REGION},"\
"project=${GCP_PROJECT},"\
"job_name=librilight-gen-${DATE},"\
"staging_location=gs://${TEMP_BUCKET}/binaries/,"\
"temp_location=gs://${TEMP_BUCKET}/tmp/,"\
"[service_account_email=ml-training@cybertron-gcp-island-test-0rxn.iam.gserviceaccount.com](mailto:service_account_email=ml-training@cybertron-gcp-island-test-0rxn.iam.gserviceaccount.com),"\
"network=cybertron-gcp-island-test-0rxn-usc1-island-vpc,"\
"subnetwork=https://www.googleapis.com/compute/v1/projects/gns-network-prod-0d38/regions/us-central1/subnetworks/cybertron-gcp-island-test-0rxn-usc1-priv-island,"\
"dataflow_service_options=enable_secure_boot,"\
"experiments=use_network_tags=allow-internet-egress,"\
"no_use_public_ips,"\
"requirements_file=/tmp/beam_requirements.txt,"\
"save_main_session"

My dockerfile

# syntax=docker/dockerfile:1

FROM apache/beam_python3.8_sdk:2.43.0

# Pre-built python dependencies
RUN pip install https://github.com/zhiyun/datasets/archive/librilight.tar.gz
RUN pip install wrapt
RUN pip install pydub

# Pre-built other dependencies
# RUN apt-get update \
#  && apt-get dist-upgrade \
#  && apt-get install -y --no-install-recommends ffmpeg

# Set the entrypoint to the Apache Beam SDK launcher.
ENTRYPOINT ["/opt/apache/beam/boot"]

It would be nice if...
tfds documentation could cover this case where we need to install tfds with custom dataset on Google Cloud Dataflow workers.

Environment information
(if applicable)

  • Operating System:
  • Python version: python 3.8
  • tensorflow-datasets/tfds-nightly version: I have tried both tensorflow-dataset 4.8.1 and tfds-nightly. Both failed with the same errors.
  • tensorflow/tensorflow-gpu/tf-nightly/tf-nightly-gpu version:

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions