|
21 | 21 | Arrow Columnar Format
|
22 | 22 | *********************
|
23 | 23 |
|
24 |
| -*Version: 1.3* |
| 24 | +*Version: 1.4* |
25 | 25 |
|
26 | 26 | The "Arrow Columnar Format" includes a language-agnostic in-memory
|
27 | 27 | data structure specification, metadata serialization, and a protocol
|
@@ -108,6 +108,10 @@ the different physical layouts defined by Arrow:
|
108 | 108 | * **Variable-size Binary**: a sequence of values each having a variable
|
109 | 109 | byte length. Two variants of this layout are supported using 32-bit
|
110 | 110 | and 64-bit length encoding.
|
| 111 | +* **Views of Variable-size Binary**: a sequence of values each having a |
| 112 | + variable byte length. In contrast to Variable-size Binary, the values |
| 113 | + of this layout are distributed across potentially multiple buffers |
| 114 | + instead of densely and sequentially packed in a single buffer. |
111 | 115 | * **Fixed-size List**: a nested layout where each value has the same
|
112 | 116 | number of elements taken from a child data type.
|
113 | 117 | * **Variable-size List**: a nested layout where each value is a
|
@@ -350,6 +354,51 @@ will be represented as follows: ::
|
350 | 354 | |----------------|-----------------------|
|
351 | 355 | | joemark | unspecified (padding) |
|
352 | 356 |
|
| 357 | +Variable-size Binary View Layout |
| 358 | +-------------------------------- |
| 359 | + |
| 360 | +.. versionadded:: Arrow Columnar Format 1.4 |
| 361 | + |
| 362 | +Each value in this layout consists of 0 or more bytes. These bytes' |
| 363 | +locations are indicated using a **views** buffer, which may point to one |
| 364 | +of potentially several **data** buffers or may contain the characters |
| 365 | +inline. |
| 366 | + |
| 367 | +The views buffer contains `length` view structures with the following layout: |
| 368 | + |
| 369 | +:: |
| 370 | + |
| 371 | + * Short strings, length <= 12 |
| 372 | + | Bytes 0-3 | Bytes 4-15 | |
| 373 | + |------------|---------------------------------------| |
| 374 | + | length | data (padded with 0) | |
| 375 | + |
| 376 | + * Long strings, length > 12 |
| 377 | + | Bytes 0-3 | Bytes 4-7 | Bytes 8-11 | Bytes 12-15 | |
| 378 | + |------------|------------|------------|-------------| |
| 379 | + | length | prefix | buf. index | offset | |
| 380 | + |
| 381 | +In both the long and short string cases, the first four bytes encode the |
| 382 | +length of the string and can be used to determine how the rest of the view |
| 383 | +should be interpreted. |
| 384 | + |
| 385 | +In the short string case the string's bytes are inlined- stored inside the |
| 386 | +view itself, in the twelve bytes which follow the length. |
| 387 | + |
| 388 | +In the long string case, a buffer index indicates which data buffer |
| 389 | +stores the data bytes and an offset indicates where in that buffer the |
| 390 | +data bytes begin. Buffer index 0 refers to the first data buffer, IE |
| 391 | +the first buffer **after** the validity buffer and the views buffer. |
| 392 | +The half-open range ``[offset, offset + length)`` must be entirely contained |
| 393 | +within the indicated buffer. A copy of the first four bytes of the string is |
| 394 | +stored inline in the prefix, after the length. This prefix enables a |
| 395 | +profitable fast path for string comparisons, which are frequently determined |
| 396 | +within the first four bytes. |
| 397 | + |
| 398 | +All integers (length, buffer index, and offset) are signed. |
| 399 | + |
| 400 | +This layout is adapted from TU Munich's `UmbraDB`_. |
| 401 | + |
353 | 402 | .. _variable-size-list-layout:
|
354 | 403 |
|
355 | 404 | Variable-size List Layout
|
@@ -880,19 +929,20 @@ For the avoidance of ambiguity, we provide listing the order and type
|
880 | 929 | of memory buffers for each layout.
|
881 | 930 |
|
882 | 931 | .. csv-table:: Buffer Layouts
|
883 |
| - :header: "Layout Type", "Buffer 0", "Buffer 1", "Buffer 2" |
884 |
| - :widths: 30, 20, 20, 20 |
885 |
| - |
886 |
| - "Primitive",validity,data, |
887 |
| - "Variable Binary",validity,offsets,data |
888 |
| - "List",validity,offsets, |
889 |
| - "Fixed-size List",validity,, |
890 |
| - "Struct",validity,, |
891 |
| - "Sparse Union",type ids,, |
892 |
| - "Dense Union",type ids,offsets, |
893 |
| - "Null",,, |
894 |
| - "Dictionary-encoded",validity,data (indices), |
895 |
| - "Run-end encoded",,, |
| 932 | + :header: "Layout Type", "Buffer 0", "Buffer 1", "Buffer 2", "Variadic Buffers" |
| 933 | + :widths: 30, 20, 20, 20, 20 |
| 934 | + |
| 935 | + "Primitive",validity,data,, |
| 936 | + "Variable Binary",validity,offsets,data, |
| 937 | + "Variable Binary View",validity,views,,data |
| 938 | + "List",validity,offsets,, |
| 939 | + "Fixed-size List",validity,,, |
| 940 | + "Struct",validity,,, |
| 941 | + "Sparse Union",type ids,,, |
| 942 | + "Dense Union",type ids,offsets,, |
| 943 | + "Null",,,, |
| 944 | + "Dictionary-encoded",validity,data (indices),, |
| 945 | + "Run-end encoded",,,, |
896 | 946 |
|
897 | 947 | Logical Types
|
898 | 948 | =============
|
@@ -1071,6 +1121,39 @@ bytes. Since this metadata can be used to communicate in-memory pointer
|
1071 | 1121 | addresses between libraries, it is recommended to set ``size`` to the actual
|
1072 | 1122 | memory size rather than the padded size.
|
1073 | 1123 |
|
| 1124 | +Variadic buffers |
| 1125 | +^^^^^^^^^^^^^^^^ |
| 1126 | + |
| 1127 | +Some types such as Utf8View are represented using a variable number of buffers. |
| 1128 | +For each such Field in the pre-ordered flattened logical schema, there will be |
| 1129 | +an entry in ``variadicBufferCounts`` to indicate the number of variadic buffers |
| 1130 | +which belong to that Field in the current RecordBatch. |
| 1131 | + |
| 1132 | +For example, consider the schema :: |
| 1133 | + |
| 1134 | + col1: Struct<a: Int32, b: BinaryView, c: Float64> |
| 1135 | + col2: Utf8View |
| 1136 | + |
| 1137 | +This has two fields with variadic buffers, so ``variadicBufferCounts`` will |
| 1138 | +have two entries in each RecordBatch. For a RecordBatch of this schema with |
| 1139 | +``variadicBufferCounts = [3, 2]``, the flattened buffers would be:: |
| 1140 | + |
| 1141 | + buffer 0: col1 validity |
| 1142 | + buffer 1: col1.a validity |
| 1143 | + buffer 2: col1.a values |
| 1144 | + buffer 3: col1.b validity |
| 1145 | + buffer 4: col1.b views |
| 1146 | + buffer 5: col1.b data |
| 1147 | + buffer 6: col1.b data |
| 1148 | + buffer 7: col1.b data |
| 1149 | + buffer 8: col1.c validity |
| 1150 | + buffer 9: col1.c values |
| 1151 | + buffer 10: col2 validity |
| 1152 | + buffer 11: col2 views |
| 1153 | + buffer 12: col2 data |
| 1154 | + buffer 13: col2 data |
| 1155 | + |
| 1156 | + |
1074 | 1157 | Byte Order (`Endianness`_)
|
1075 | 1158 | ---------------------------
|
1076 | 1159 |
|
@@ -1346,3 +1429,4 @@ the Arrow spec.
|
1346 | 1429 | .. _Endianness: https://en.wikipedia.org/wiki/Endianness
|
1347 | 1430 | .. _SIMD: https://software.intel.com/en-us/cpp-compiler-developer-guide-and-reference-introduction-to-the-simd-data-layout-templates
|
1348 | 1431 | .. _Parquet: https://parquet.apache.org/docs/
|
| 1432 | +.. _UmbraDB: https://db.in.tum.de/~freitag/papers/p29-neumann-cidr20.pdf |
0 commit comments