Skip to content
This repository was archived by the owner on Nov 11, 2022. It is now read-only.
This repository was archived by the owner on Nov 11, 2022. It is now read-only.

Dataflow uses incorrect full file size with GS file using Content-Encoding: gzip #517

Open
@rfevang

Description

@rfevang

To reproduce:

  • Upload a simple file (10000 sequential numbers, one per line) to Google storage specifying GZIP compression gsutil cp -Z numbers.txt gs://<bucket>/numbers.txt.

  • Execute a simple dataflow just reading, then writing these numbers:

p.apply(TextIO.Read.from("gs://<bucket>/numbers.txt"))
 .apply(TextIO.Write.to("gs://<bucket>/out").withSuffix(".txt"));

Expected: Either all 10000 numbers written, or alternately gibberish written (raw compressed data).
Actual: A subset of numbers written (1-4664). Looks like it reads the decompressed file as if its size was that of the file before decompression.

Specifying GZIP decompression mode works as expected (all 10000 numbers written):

p.apply(TextIO.Read.from("gs://<bucket>/numbers.txt")
       .withCompressionType(CompressionType.GZIP))
 .apply(TextIO.Write.to("gs://<bucket>/out").withSuffix(".txt"));

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions