Skip to content

Commit 11f76bb

Browse files
authored
v4.6.0: Class Serializer
This release contains completely rewritten class serializer using .NET expression trees. All the class serialisation documentation has been updated to reflect this. Existing `ParquetConvert` serializer is left untouched and marked obsolete, hence there is no increase in the major version number. If you are using `ParquetConvert`, consider switching to `ParquetSerializer` soon, because no new features or fixes will be added to `ParquetConvert`. New class serializer supports all primitive types and nested types (structs, lists, maps and their combinations). It also fully conforms to Dremel specification, which was a massive headache to implement properly. Other improvements: - Documentation updated for low-level and serializer API on how to use nested data types. - Corrected how definition and repetition levels are calculated, making it more conformant to Parquet specification. - Schema path calculation logic improved and does not rely on string splitting/joining, which allows you to use any characters anywhere in column names (tensorflow#278)
1 parent 1a9ec40 commit 11f76bb

File tree

84 files changed

+2937
-891
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

84 files changed

+2937
-891
lines changed

.github/workflows/full.yml

Lines changed: 2 additions & 35 deletions
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,8 @@
11
name: 'Full Workflow'
22

33
env:
4-
VERSION: 4.5.4
5-
ASM_VERSION: 4.0.0
4+
VERSION: 4.6.0
5+
ASM_VERSION: 4.6.0
66

77
on:
88
push:
@@ -32,39 +32,6 @@ jobs:
3232
- name: 'test on ${{ matrix.os }}'
3333
run: dotnet test src/Parquet.sln -c release
3434

35-
#run-benchmarks:
36-
# runs-on: ${{ matrix.os }}
37-
# strategy:
38-
# matrix:
39-
# os: [ubuntu-latest, windows-latest, macos-latest]
40-
# fail-fast: false
41-
# steps:
42-
# - uses: actions/checkout@v3
43-
# - name: Setup .NET
44-
# uses: actions/setup-dotnet@v3
45-
# with:
46-
# dotnet-version: |
47-
# 3.1.x
48-
# 6.0.x
49-
# 7.0.x
50-
# - name: 'Write Performance'
51-
# run: dotnet run -c release -- write
52-
# working-directory: src/Parquet.PerfRunner
53-
# - name: 'Prep'
54-
# run: mv results ${{ matrix.os }}
55-
# working-directory: src/Parquet.PerfRunner/BenchmarkDotNet.Artifacts
56-
57-
# - name: debug
58-
# run: ls -R
59-
# working-directory: src/Parquet.PerfRunner
60-
61-
# - uses: actions/upload-artifact@v3
62-
# name: Collect Results
63-
# with:
64-
# name: benchmarks
65-
# path: |
66-
# src/Parquet.PerfRunner/BenchmarkDotNet.Artifacts/${{ matrix.os }}/
67-
6835

6936
build:
7037

docs/README.md

Lines changed: 33 additions & 17 deletions
Original file line numberDiff line numberDiff line change
@@ -7,15 +7,22 @@
77

88
**Fully portable, managed** .NET library to 📖read and ✍️write [Apache Parquet](https://parquet.apache.org/) files. Targets `.NET 7`, `.NET 6.0`, `.NET Core 3.1`, `.NET Standard 2.1` and `.NET Standard 2.0`.
99

10-
Runs everywhere .NET runs Linux, MacOS, Windows, iOS, Android, Tizen, Xbox, PS4, Raspberry Pi, Samsung TVs and much more.
10+
Whether you want to build apps for Linux, MacOS, Windows, iOS, Android, Tizen, Xbox, PS4, Raspberry Pi, Samsung TVs or much more, Parquet.NET has you covered.
1111

12-
## Quick Start
12+
## Why
13+
14+
Parquet is a great format for storing and processing large amounts of data, but it can be tricky to use with .NET. That's why this library is here to help. It's a pure library that doesn't need any external dependencies, and it's super fast - faster than Python and Java, and other C# solutions. It's also native to .NET, so you don't have to deal with any wrappers or adapters that might slow you down or limit your options.
15+
16+
This library is the best option for parquet files in .NET. It has a simple and intuitive API, supports all the parquet features you need, and handles complex scenarios with ease.
1317

14-
Why should I use this? I think you shouldn't. Go away and look at better alternatives, like [PyArrow](https://arrow.apache.org/docs/python/) that does it much better in Python. Also I'd rather you use [Apache Spark](https://spark.apache.org/) with native support for Parquet and other commercial alternatives. Seriously. Comparing to those, this library is just pure shite, developed in spare time by one person. Despite that, it's a de facto standard for .NET when it comes to reading and writing Parquet files. Why? Because:
18+
Also it:
1519

16-
- It has zero dependencies - pure library that just works.
17-
- It's really fast. Faster than Python and Java implementations.
18-
- It's .NET native. Designed to utilise .NET and made for .NET developers.
20+
- Has zero dependencies - pure library that just works.
21+
- Really fast. Faster than Python and Java, and alternative C# implementations out there. It's often even faster than native C++ implementations.
22+
- .NET native. Designed to utilise .NET and made for .NET developers, not the other way around.
23+
- Not a "wrapper" that forces you to fit in. It's the other way around - forces parquet to fit into .NET.
24+
25+
## Quick Start
1926

2027
Parquet is designed to handle *complex data in bulk*. It's *column-oriented* meaning that data is physically stored in columns rather than rows. This is very important for big data systems if you want to process only a subset of columns - reading just the right columns is extremely efficient.
2128

@@ -54,7 +61,7 @@ var data = Enumerable.Range(0, 1_000_000).Select(i => new Record {
5461
Now, to write these to a file in say `/mnt/storage/data.parquet` you can use the following **line** of code:
5562

5663
```csharp
57-
await ParquetConvert.SerializeAsync(data, "/mnt/storage/data.parquet");
64+
await ParquetSerializer.SerializeAsync(data, "/mnt/storage/data.parquet");
5865
```
5966

6067
That's pretty much it! You can [customise many things](serialisation.md) in addition to the magical magic process, but if you are a really lazy person that will do just fine for today.
@@ -134,7 +141,6 @@ using(Stream fs = System.IO.File.OpenWrite("/mnt/storage/data.parquet")) {
134141
await groupWriter.WriteColumnAsync(column1);
135142
await groupWriter.WriteColumnAsync(column2);
136143
await groupWriter.WriteColumnAsync(column3);
137-
138144
}
139145
}
140146
}
@@ -147,7 +153,7 @@ What's going on?:?:
147153
3. Row group is like a data partition inside the file. In this example we have just one, but you can create more if there are too many values that are hard to fit in computer memory.
148154
4. Three calls to row group writer write out the columns. Note that those are performed sequentially, and in the same order as schema defines them.
149155

150-
Read more on writing [here](writing.md).
156+
Read more on writing [here](writing.md) which also includes guides on writing [nested types](nested_types.md) such as lists, maps, and structs.
151157

152158
### 📖Reading Data
153159

@@ -158,7 +164,7 @@ Reading data also has three different approaches, so I'm going to unwrap them he
158164
Provided that you have written the data, or just have some external data with the same structure as above, you can read those by simply doing the following:
159165

160166
```csharp
161-
Record[] data2 = await ParquetConvert.DeserializeAsync<Record>("/mnt/storage/data.parquet");
167+
IList<Record> data = await ParquetSerializer.DeserializeAsync<Record>("/mnt/storage/data.parquet");
162168
```
163169

164170
This will give us an array with one million class instances similar to this:
@@ -216,15 +222,25 @@ This is what's happening:
216222

217223
If you have a choice, then the choice is easy - use Low Level API. They are the fastest and the most flexible. But what if you for some reason don't have a choice? Then think about this:
218224

219-
| Feature | 🚤Class Serialisation | 🌛Table API | ⚙️Low Level API |
220-
| --------------------- | ---------------------- | ---------------- | ---------------- |
221-
| Performance | high | very low | very high |
222-
| Developer Convenience | feels like C# (great!) | feels like Excel | close to Parquet |
223-
| Row based access | easy | easy | hard |
224-
| Column based access | hard | hard | easy |
225+
| Feature | 🚤Class Serialisation | 🌛Table API | ⚙️Low Level API |
226+
| --------------------- | -------------------- | ---------------- | -------------------------- |
227+
| Performance | high | very low | very high |
228+
| Developer Convenience | C# native | feels like Excel | close to Parquet internals |
229+
| Row based access | easy | easy | hard |
230+
| Column based access | C# native | hard | easy |
225231

226232

227233

228234
## Contributing
229235

230-
Any contributions are welcome, in any form. Documentation, code, tests, donations or anything else. I don't like processes so anything goes. If you happen to get interested in parquet development, there are some [interesting links](parquet-getting-started-md).
236+
Any contributions are welcome, in any form. Documentation, code, tests, donations or anything else. I don't like processes so anything goes. If you happen to get interested in parquet development, there are some [interesting links](parquet-getting-started.md).
237+
238+
## Special Thanks
239+
240+
Without these tools development would be really painful.
241+
242+
- [Visual Studio Community](https://visualstudio.microsoft.com/vs/community/) - free IDE from Microsoft. The best in class C# and C++ development tool. It's worth using Windows just because Visual Studio exists there.
243+
- [JetBrains Rider](https://www.jetbrains.com/rider/) - for their cross-platform C# IDE, which has some great features.
244+
- [IntelliJ IDEA](https://www.jetbrains.com/idea/) - the best Python, Scala and Java IDE.
245+
- [LINQPad](https://www.linqpad.net/) - extremely powerful C# REPL with unique visualisation features, IL decompiler, expression tree visualiser, benchmarking, charting and so on. Again it's worth having Windows just for this tool. Please support the author and purchase it.
246+
- [Benchmarkdotnet](https://benchmarkdotnet.org/) - the best cross-platform tool that can microbenchmark C# code. This library is faster than native ones only thanks for this.

docs/complex-types.md

Lines changed: 0 additions & 43 deletions
This file was deleted.

docs/img/struct-path.png

165 KB
Loading

docs/legacy_serialisation.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
# Class Serialisation
22

3+
> This document refers to legacy serialisation, which is still in the library, but is marked as obsolete and will be removed by the end of 2023. No new features will be added and you should [migrate](serialisation.md).
4+
35
Parquet library is generally extremely flexible in terms of supporting internals of the Apache Parquet format and allows you to do whatever the low level API allow to. However, in many cases writing boilerplate code is not suitable if you are working with business objects and just want to serialise them into a parquet file.
46

57
Class serialisation is **really fast** as it generates [MSIL](https://en.wikipedia.org/wiki/Common_Intermediate_Language) on the fly. That means there is a tiny bit of delay when serialising a first entity, which in most cases is negligible. Once the class is serialised at least once, further operations become blazingly fast (around *x40* speed improvement comparing to reflection on relatively large amounts of data (~5 million records)).

0 commit comments

Comments
 (0)