-
Notifications
You must be signed in to change notification settings - Fork 1.6k
Description
For TFDS 4.9.7 on Dataflow 2.60.0, I have a company-internal Dataflow job that fails. Given the input collection:
Elements added 332,090
Estimated size 1.74 TB
to train_write/GroupShards
, where the output collection reports:
Elements added 2
Estimated size 1.8 GB
it then fails on the next element with
"E0123 207 recordwriter.cc:401] Record exceeds maximum record size (1096571470 > 1073741823)."
Workaround
By installing the TFDS prerelease after 3700745 and controlling --num_shards=4096
(auto-detection choose 2048), the DatasetBuilder runs to completion on Dataflow. I'm curious why the auto-detection didn't choose more file shards however, as all training examples should be roughly the same size in this DatasetBuilder.
Suggested fix
Maybe this
max_shard_size = 0.9 * cls.max_shard_size |
overhead: int = 16 |
Side remark
Surprisingly Dataflow limits mention
Maximum size for a single element (except where stricter conditions apply, for example Streaming Engine). 2 GB
which doesn't seem to be true in practice since the GroupBy fails on ~1 GB as per the logged error.