Description
What I need help with / What I was wondering
I need help downloading the LVIS dataset to my EC2 instance.
What I've tried so far
First, I copied the changes from #5094.
Then, I tried using the SDK to download_and_prepare
the dataset as follows
import tensorflow_datasets as tfds
builder = tfds.builder("lvis")
builder.download_and_prepare()
I also tried adding more parameters for the DirectRunner
import apache_beam as beam
import tensorflow_datasets as tfds
builder = tfds.builder("lvis")
flags = ["--direct_num_workers=4", "--direct_running_mode=multi_processing"]
builder.download_and_prepare(
download_config=tfds.download.DownloadConfig(
beam_runner="DirectRunner",
beam_options=beam.options.pipeline_options.PipelineOptions(flags=flags),
)
)
After around 10ish minutes I can see 4 CPUs at near 100% utilization, so I think the builder is working. It runs for a while, 30 minutes to a couple hours depending on how many workers I specify, then either hits an error or runs out of memory and gets killed. If I remember correctly, this dataset is about ~25 GB in size. My machine has 64 GB of RAM.
It would be nice if...
It would be most convenient for me if I could just download an already built version of the dataset so I could avoid needing to build it myself. I don't really understand what goes on during the build. I just need this dataset locally in TFDS format so I can train a model that's been written to consume this dataset in this format. I'd rather not have to learn about Apache Beam and set up Google Cloud infrastructure just to get a 25 GB dataset.
If that's not possible, then it would be nice if I could build the LVIS dataset locally more easily.
Environment information
(if applicable)
- Operating System: Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1103-aws aarch64)
- Python version: 3.10.13
- tensorflow version: 2.14.0
- tensorflow-cpu-aws version: 2.14.0
- tensorflow-datasets version: 4.9.3
- tensorflow-io-gcs-filesystem version: 0.34.0
- apache-beam version: 2.51.0
- EC2 instance type: r6g.2xlarge