Skip to content

Commit e3473f6

Browse files
committed
support custom row group sizes in ParquetSerializer (tensorflow#291)
1 parent e7e907d commit e3473f6

File tree

10 files changed

+2095
-1968
lines changed

10 files changed

+2095
-1968
lines changed

docs/serialisation.md

+10
Original file line numberDiff line numberDiff line change
@@ -318,6 +318,16 @@ await ParquetSerializer.SerializeAsync(dataBatch3, ms, new ParquetSerializerOpti
318318

319319
By following this pattern, you can easily append data to a Parquet file using `ParquetSerializer`.
320320

321+
## Specifying Row Group Size
322+
323+
Row groups are a logical division of data in a parquet file. They allow efficient filtering and scanning of data based on predicates. By default, all the class instances are serialized into a single row group, which is absolutely fine. If you need to set a custom row group size, you can specify it in `ParquetSerializerOptions` like so:
324+
325+
```csharp
326+
await ParquetSerializer.SerializeAsync(data, stream, new ParquetSerializerOptions { RowGroupSize = 10_000_000 });
327+
```
328+
329+
Note that small row groups make parquet files very inefficient in general, so you should use this parameter only when you are absolutely sure what you are doing. For example, if you have a very large dataset that needs to be split into smaller files for distributed processing, you might want to use a smaller row group size to avoid having too many rows in one file. However, this will also increase the file size and the metadata overhead, so you should balance the trade-offs carefully.
330+
321331
## FAQ
322332

323333
**Q.** Can I specify schema for serialisation/deserialisation.

docs/writing.md

-1
Original file line numberDiff line numberDiff line change
@@ -53,7 +53,6 @@ using(ParquetWriter parquetWriter = await ParquetWriter.CreateAsync(schema, file
5353
}
5454
```
5555

56-
5756
### Appending to Files
5857

5958
This lib supports pseudo appending to files, however it's worth keeping in mind that *row groups are immutable* by design, therefore the only way to append is to create a new row group at the end of the file. It's worth mentioning that small row groups make data compression and reading extremely ineffective, therefore the larger your row group the better.

0 commit comments

Comments
 (0)