Skip to content

Commit be8a953

Browse files
twitualamb
andauthored
Add note on using larger row group size (#8745)
* Add note on using larger row group size * Nit * prettier * prettier * update test --------- Co-authored-by: Andrew Lamb <[email protected]>
1 parent b3e17e7 commit be8a953

File tree

4 files changed

+23
-21
lines changed

4 files changed

+23
-21
lines changed

datafusion/common/src/config.rs

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -350,7 +350,9 @@ config_namespace! {
350350
/// default parquet writer setting
351351
pub max_statistics_size: Option<usize>, default = None
352352

353-
/// Sets maximum number of rows in a row group
353+
/// Target maximum number of rows in each row group (defaults to 1M
354+
/// rows). Writing larger row groups requires more memory to write, but
355+
/// can get better compression and be faster to read.
354356
pub max_row_group_size: usize, default = 1024 * 1024
355357

356358
/// Sets "created by" property

datafusion/sqllogictest/test_files/information_schema.slt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -242,7 +242,7 @@ datafusion.execution.parquet.dictionary_enabled NULL Sets if dictionary encoding
242242
datafusion.execution.parquet.dictionary_page_size_limit 1048576 Sets best effort maximum dictionary page size, in bytes
243243
datafusion.execution.parquet.enable_page_index true If true, reads the Parquet data page level metadata (the Page Index), if present, to reduce the I/O and number of rows decoded.
244244
datafusion.execution.parquet.encoding NULL Sets default encoding for any column Valid values are: plain, plain_dictionary, rle, bit_packed, delta_binary_packed, delta_length_byte_array, delta_byte_array, rle_dictionary, and byte_stream_split. These values are not case sensitive. If NULL, uses default parquet writer setting
245-
datafusion.execution.parquet.max_row_group_size 1048576 Sets maximum number of rows in a row group
245+
datafusion.execution.parquet.max_row_group_size 1048576 Target maximum number of rows in each row group (defaults to 1M rows). Writing larger row groups requires more memory to write, but can get better compression and be faster to read.
246246
datafusion.execution.parquet.max_statistics_size NULL Sets max statistics size for any column. If NULL, uses default parquet writer setting
247247
datafusion.execution.parquet.maximum_buffered_record_batches_per_stream 2 By default parallel parquet writer is tuned for minimum memory usage in a streaming execution plan. You may see a performance benefit when writing large parquet files by increasing maximum_parallel_row_group_writers and maximum_buffered_record_batches_per_stream if your system has idle cores and can tolerate additional memory usage. Boosting these values is likely worthwhile when writing out already in-memory data, such as from a cached data frame.
248248
datafusion.execution.parquet.maximum_parallel_row_group_writers 1 By default parallel parquet writer is tuned for minimum memory usage in a streaming execution plan. You may see a performance benefit when writing large parquet files by increasing maximum_parallel_row_group_writers and maximum_buffered_record_batches_per_stream if your system has idle cores and can tolerate additional memory usage. Boosting these values is likely worthwhile when writing out already in-memory data, such as from a cached data frame.

docs/source/user-guide/configs.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ Environment variables are read during `SessionConfig` initialisation so they mus
6363
| datafusion.execution.parquet.dictionary_page_size_limit | 1048576 | Sets best effort maximum dictionary page size, in bytes |
6464
| datafusion.execution.parquet.statistics_enabled | NULL | Sets if statistics are enabled for any column Valid values are: "none", "chunk", and "page" These values are not case sensitive. If NULL, uses default parquet writer setting |
6565
| datafusion.execution.parquet.max_statistics_size | NULL | Sets max statistics size for any column. If NULL, uses default parquet writer setting |
66-
| datafusion.execution.parquet.max_row_group_size | 1048576 | Sets maximum number of rows in a row group |
66+
| datafusion.execution.parquet.max_row_group_size | 1048576 | Target maximum number of rows in each row group (defaults to 1M rows). Writing larger row groups requires more memory to write, but can get better compression and be faster to read. |
6767
| datafusion.execution.parquet.created_by | datafusion version 34.0.0 | Sets "created by" property |
6868
| datafusion.execution.parquet.column_index_truncate_length | NULL | Sets column index truncate length |
6969
| datafusion.execution.parquet.data_page_row_count_limit | 18446744073709551615 | Sets best effort maximum number of rows in data page |

docs/source/user-guide/sql/write_options.md

Lines changed: 18 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -100,21 +100,21 @@ The following options are available when writing CSV files. Note: if any unsuppo
100100

101101
The following options are available when writing parquet files. If any unsupported option is specified an error will be raised and the query will fail. If a column specific option is specified for a column which does not exist, the option will be ignored without error. For default values, see: [Configuration Settings](https://arrow.apache.org/datafusion/user-guide/configs.html).
102102

103-
| Option | Can be Column Specific? | Description |
104-
| ---------------------------- | ----------------------- | ------------------------------------------------------------------------------------------------------------- |
105-
| COMPRESSION | Yes | Sets the compression codec and if applicable compression level to use |
106-
| MAX_ROW_GROUP_SIZE | No | Sets the maximum number of rows that can be encoded in a single row group |
107-
| DATA_PAGESIZE_LIMIT | No | Sets the best effort maximum page size in bytes |
108-
| WRITE_BATCH_SIZE | No | Maximum number of rows written for each column in a single batch |
109-
| WRITER_VERSION | No | Parquet writer version (1.0 or 2.0) |
110-
| DICTIONARY_PAGE_SIZE_LIMIT | No | Sets best effort maximum dictionary page size in bytes |
111-
| CREATED_BY | No | Sets the "created by" property in the parquet file |
112-
| COLUMN_INDEX_TRUNCATE_LENGTH | No | Sets the max length of min/max value fields in the column index. |
113-
| DATA_PAGE_ROW_COUNT_LIMIT | No | Sets best effort maximum number of rows in a data page. |
114-
| BLOOM_FILTER_ENABLED | Yes | Sets whether a bloom filter should be written into the file. |
115-
| ENCODING | Yes | Sets the encoding that should be used (e.g. PLAIN or RLE) |
116-
| DICTIONARY_ENABLED | Yes | Sets if dictionary encoding is enabled. Use this instead of ENCODING to set dictionary encoding. |
117-
| STATISTICS_ENABLED | Yes | Sets if statistics are enabled at PAGE or ROW_GROUP level. |
118-
| MAX_STATISTICS_SIZE | Yes | Sets the maximum size in bytes that statistics can take up. |
119-
| BLOOM_FILTER_FPP | Yes | Sets the false positive probability (fpp) for the bloom filter. Implicitly sets BLOOM_FILTER_ENABLED to true. |
120-
| BLOOM_FILTER_NDV | Yes | Sets the number of distinct values (ndv) for the bloom filter. Implicitly sets bloom_filter_enabled to true. |
103+
| Option | Can be Column Specific? | Description |
104+
| ---------------------------- | ----------------------- | ----------------------------------------------------------------------------------------------------------------------------------- |
105+
| COMPRESSION | Yes | Sets the compression codec and if applicable compression level to use |
106+
| MAX_ROW_GROUP_SIZE | No | Sets the maximum number of rows that can be encoded in a single row group. Larger row groups require more memory to write and read. |
107+
| DATA_PAGESIZE_LIMIT | No | Sets the best effort maximum page size in bytes |
108+
| WRITE_BATCH_SIZE | No | Maximum number of rows written for each column in a single batch |
109+
| WRITER_VERSION | No | Parquet writer version (1.0 or 2.0) |
110+
| DICTIONARY_PAGE_SIZE_LIMIT | No | Sets best effort maximum dictionary page size in bytes |
111+
| CREATED_BY | No | Sets the "created by" property in the parquet file |
112+
| COLUMN_INDEX_TRUNCATE_LENGTH | No | Sets the max length of min/max value fields in the column index. |
113+
| DATA_PAGE_ROW_COUNT_LIMIT | No | Sets best effort maximum number of rows in a data page. |
114+
| BLOOM_FILTER_ENABLED | Yes | Sets whether a bloom filter should be written into the file. |
115+
| ENCODING | Yes | Sets the encoding that should be used (e.g. PLAIN or RLE) |
116+
| DICTIONARY_ENABLED | Yes | Sets if dictionary encoding is enabled. Use this instead of ENCODING to set dictionary encoding. |
117+
| STATISTICS_ENABLED | Yes | Sets if statistics are enabled at PAGE or ROW_GROUP level. |
118+
| MAX_STATISTICS_SIZE | Yes | Sets the maximum size in bytes that statistics can take up. |
119+
| BLOOM_FILTER_FPP | Yes | Sets the false positive probability (fpp) for the bloom filter. Implicitly sets BLOOM_FILTER_ENABLED to true. |
120+
| BLOOM_FILTER_NDV | Yes | Sets the number of distinct values (ndv) for the bloom filter. Implicitly sets bloom_filter_enabled to true. |

0 commit comments

Comments
 (0)