You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
`Field` is a schema field that defines this column. You can obtain this field from a schema you define.
24
+
25
+
`DefinedData` is raw data that is defined by `Field`'s type. If your field is nullable, `DefinedData` represents non-nullable values. On the other hand, `Data` represents data as-is, including nulls. If you are reading `DataColumn` and need to access the data, `Data` is your field. If you need to access data as it's stored in parquet file, use `DefinedData`. The names are chosen mostly due to backward compatibility reasons.
26
+
27
+
Going further, if you need to access *repetition and definition levels* as they are stored in parquet file, you can use the corresponding `DefinitionLevels` and `RepetitionLevels` fields.
28
+
29
+
## Creating DataColumn
30
+
31
+
There are two public constuctors available (see above diagram). For convenience and backward compatibility, the second constuctor accepts `DataField` and two parameters:
32
+
33
+
1.`data` is data to write, including nulls if the field is nullable. DataColumn will decompose the data array into `DefinitionLevels` and `DefinedData` on construction.
34
+
2.`repetitionLevels` are only required if a field is a part of a nested type.
35
+
36
+
The first constructor is more granular and allows you to specify all three parts when constructing a colum.
Copy file name to clipboardExpand all lines: docs/writing.md
+25-20
Original file line number
Diff line number
Diff line change
@@ -10,20 +10,24 @@ Writing files is a multi stage process, giving you the full flexibility on what
10
10
4. When required, repeat from step (2) to create more row groups. A row groups is like a physical data partition that should fit in memory for processing. It's a guess game how much data should be in a single row group, but a number of at least 5 thousand rows per column is great. Remember that parquet format works best on large chunks of data.
11
11
12
12
```csharp
13
+
// create file schema
14
+
varschema=newParquetSchema(
15
+
newDataField<int>("id"),
16
+
newDataField<string>("city"));
17
+
13
18
//create data columns with schema metadata and the data you need
To read more about DataColumn, see [this page](column.md).
41
+
42
+
### Specifying Compression Method and Level
37
43
38
44
After constructing `ParquetWriter` you can optionally set compression method ([`CompressionMethod`](../src/Parquet/CompressionMethod.cs)) and/or compression level ([`CompressionLevel`](https://learn.microsoft.com/en-us/dotnet/api/system.io.compression.compressionlevel?view=net-7.0)) which defaults to `Snappy`. Unless you have specific needs to override compression, the default are very reasonable.
This lib supports pseudo appending to files, however it's worth keeping in mind that *row groups are immutable* by design, therefore the only way to append is to create a new row group at the end of the file. It's worth mentioning that small row groups make data compression and reading extremely ineffective, therefore the larger your row group the better.
54
60
55
61
This should make you The following code snippet illustrates this:
Note that you have to specify that you are opening `ParquetWriter` in **append** mode in it's constructor explicitly - `new ParquetWriter(new Schema(id), ms, append: true)`. Doing so makes parquet.net open the file, find the file footer and delete it, rewinding current stream position to the end of actual data. Then, creating more row groups simply writes data to the file as usual, and `.Dispose()` on `ParquetWriter` generates a new file footer, writes it to the file and closes down the stream.
96
101
97
102
Please keep in mind that row groups are designed to hold a large amount of data (5'0000 rows on average) therefore try to find a large enough batch to append to the file. Do not treat parquet file as a row stream by creating a row group and placing 1-2 rows in it, because this will both increase file size massively and cause a huge performance degradation for a client reading such a file.
98
103
99
-
# Custom Metadata
104
+
###Custom Metadata
100
105
101
106
To read and write custom file metadata, you can use `CustomMetadata` property on `ParquetFileReader` and `ParquetFileWriter`, i.e.
0 commit comments