Skip to content

Need help building LVIS locally #5113

Open
@JKelle

Description

@JKelle

What I need help with / What I was wondering
I need help downloading the LVIS dataset to my EC2 instance.

What I've tried so far
First, I copied the changes from #5094.
Then, I tried using the SDK to download_and_prepare the dataset as follows

import tensorflow_datasets as tfds

builder = tfds.builder("lvis")
builder.download_and_prepare()

I also tried adding more parameters for the DirectRunner

import apache_beam as beam
import tensorflow_datasets as tfds

builder = tfds.builder("lvis")
flags = ["--direct_num_workers=4", "--direct_running_mode=multi_processing"]
builder.download_and_prepare(
    download_config=tfds.download.DownloadConfig(
        beam_runner="DirectRunner",
        beam_options=beam.options.pipeline_options.PipelineOptions(flags=flags),
    )
)

After around 10ish minutes I can see 4 CPUs at near 100% utilization, so I think the builder is working. It runs for a while, 30 minutes to a couple hours depending on how many workers I specify, then either hits an error or runs out of memory and gets killed. If I remember correctly, this dataset is about ~25 GB in size. My machine has 64 GB of RAM.

It would be nice if...
It would be most convenient for me if I could just download an already built version of the dataset so I could avoid needing to build it myself. I don't really understand what goes on during the build. I just need this dataset locally in TFDS format so I can train a model that's been written to consume this dataset in this format. I'd rather not have to learn about Apache Beam and set up Google Cloud infrastructure just to get a 25 GB dataset.

If that's not possible, then it would be nice if I could build the LVIS dataset locally more easily.

Environment information
(if applicable)

  • Operating System: Ubuntu 18.04.5 LTS (GNU/Linux 5.4.0-1103-aws aarch64)
  • Python version: 3.10.13
  • tensorflow version: 2.14.0
  • tensorflow-cpu-aws version: 2.14.0
  • tensorflow-datasets version: 4.9.3
  • tensorflow-io-gcs-filesystem version: 0.34.0
  • apache-beam version: 2.51.0
  • EC2 instance type: r6g.2xlarge

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions