This repository was archived by the owner on Nov 11, 2022. It is now read-only.
This repository was archived by the owner on Nov 11, 2022. It is now read-only.
Dataflow uses incorrect full file size with GS file using Content-Encoding: gzip #517
Open
Description
To reproduce:
-
Upload a simple file (10000 sequential numbers, one per line) to Google storage specifying GZIP compression
gsutil cp -Z numbers.txt gs://<bucket>/numbers.txt
. -
Execute a simple dataflow just reading, then writing these numbers:
p.apply(TextIO.Read.from("gs://<bucket>/numbers.txt"))
.apply(TextIO.Write.to("gs://<bucket>/out").withSuffix(".txt"));
Expected: Either all 10000 numbers written, or alternately gibberish written (raw compressed data).
Actual: A subset of numbers written (1-4664). Looks like it reads the decompressed file as if its size was that of the file before decompression.
Specifying GZIP decompression mode works as expected (all 10000 numbers written):
p.apply(TextIO.Read.from("gs://<bucket>/numbers.txt")
.withCompressionType(CompressionType.GZIP))
.apply(TextIO.Write.to("gs://<bucket>/out").withSuffix(".txt"));