You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
DataFusion is written in Rust and it uses a standard rust toolkit:
30
30
31
-
*`cargo build`
32
-
*`cargo fmt` to format the code
33
-
*`cargo test` to test
34
-
* etc.
31
+
-`cargo build`
32
+
-`cargo fmt` to format the code
33
+
-`cargo test` to test
34
+
- etc.
35
35
36
36
## How to add a new scalar function
37
37
38
38
Below is a checklist of what you need to do to add a new scalar function to DataFusion:
39
39
40
-
* Add the actual implementation of the function:
41
-
*[here](datafusion/src/physical_plan/string_expressions.rs) for string functions
42
-
*[here](datafusion/src/physical_plan/math_expressions.rs) for math functions
43
-
*[here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
44
-
* create a new module [here](datafusion/src/physical_plan) for other functions
45
-
* In [src/physical_plan/functions](datafusion/src/physical_plan/functions.rs), add:
46
-
* a new variant to `BuiltinScalarFunction`
47
-
* a new entry to `FromStr` with the name of the function as called by SQL
48
-
* a new line in `return_type` with the expected return type of the function, given an incoming type
49
-
* a new line in `signature` with the signature of the function (number and types of its arguments)
50
-
* a new line in `create_physical_expr` mapping the built-in to the implementation
51
-
* tests to the function.
52
-
* In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
53
-
* In [src/logical_plan/expr](datafusion/src/logical_plan/expr.rs), add:
54
-
* a new entry of the `unary_scalar_expr!` macro for the new function.
55
-
* In [src/logical_plan/mod](datafusion/src/logical_plan/mod.rs), add:
56
-
* a new entry in the `pub use expr::{}` set.
40
+
- Add the actual implementation of the function:
41
+
-[here](datafusion/src/physical_plan/string_expressions.rs) for string functions
42
+
-[here](datafusion/src/physical_plan/math_expressions.rs) for math functions
43
+
-[here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
44
+
- create a new module [here](datafusion/src/physical_plan) for other functions
45
+
- In [src/physical_plan/functions](datafusion/src/physical_plan/functions.rs), add:
46
+
- a new variant to `BuiltinScalarFunction`
47
+
- a new entry to `FromStr` with the name of the function as called by SQL
48
+
- a new line in `return_type` with the expected return type of the function, given an incoming type
49
+
- a new line in `signature` with the signature of the function (number and types of its arguments)
50
+
- a new line in `create_physical_expr` mapping the built-in to the implementation
51
+
- tests to the function.
52
+
- In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
53
+
- In [src/logical_plan/expr](datafusion/src/logical_plan/expr.rs), add:
54
+
- a new entry of the `unary_scalar_expr!` macro for the new function.
55
+
- In [src/logical_plan/mod](datafusion/src/logical_plan/mod.rs), add:
56
+
- a new entry in the `pub use expr::{}` set.
57
57
58
58
## How to add a new aggregate function
59
59
60
60
Below is a checklist of what you need to do to add a new aggregate function to DataFusion:
61
61
62
-
* Add the actual implementation of an `Accumulator` and `AggregateExpr`:
63
-
*[here](datafusion/src/physical_plan/string_expressions.rs) for string functions
64
-
*[here](datafusion/src/physical_plan/math_expressions.rs) for math functions
65
-
*[here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
66
-
* create a new module [here](datafusion/src/physical_plan) for other functions
67
-
* In [src/physical_plan/aggregates](datafusion/src/physical_plan/aggregates.rs), add:
68
-
* a new variant to `BuiltinAggregateFunction`
69
-
* a new entry to `FromStr` with the name of the function as called by SQL
70
-
* a new line in `return_type` with the expected return type of the function, given an incoming type
71
-
* a new line in `signature` with the signature of the function (number and types of its arguments)
72
-
* a new line in `create_aggregate_expr` mapping the built-in to the implementation
73
-
* tests to the function.
74
-
* In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
62
+
- Add the actual implementation of an `Accumulator` and `AggregateExpr`:
63
+
-[here](datafusion/src/physical_plan/string_expressions.rs) for string functions
64
+
-[here](datafusion/src/physical_plan/math_expressions.rs) for math functions
65
+
-[here](datafusion/src/physical_plan/datetime_expressions.rs) for datetime functions
66
+
- create a new module [here](datafusion/src/physical_plan) for other functions
67
+
- In [src/physical_plan/aggregates](datafusion/src/physical_plan/aggregates.rs), add:
68
+
- a new variant to `BuiltinAggregateFunction`
69
+
- a new entry to `FromStr` with the name of the function as called by SQL
70
+
- a new line in `return_type` with the expected return type of the function, given an incoming type
71
+
- a new line in `signature` with the signature of the function (number and types of its arguments)
72
+
- a new line in `create_aggregate_expr` mapping the built-in to the implementation
73
+
- tests to the function.
74
+
- In [tests/sql.rs](datafusion/tests/sql.rs), add a new test where the function is called through SQL against well known data and returns the expected result.
Copy file name to clipboardexpand all lines: README.md
+52-63
Original file line number
Diff line number
Diff line change
@@ -30,7 +30,7 @@ logical query plans as well as a query optimizer and execution engine
30
30
capable of parallel execution against partitioned data sources (CSV
31
31
and Parquet) using threads.
32
32
33
-
DataFusion also supports distributed query execution via the
33
+
DataFusion also supports distributed query execution via the
34
34
[Ballista](ballista/README.md) crate.
35
35
36
36
## Use Cases
@@ -42,24 +42,24 @@ the convenience of an SQL interface or a DataFrame API.
42
42
43
43
## Why DataFusion?
44
44
45
-
**High Performance*: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
46
-
**Easy to Connect*: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
47
-
**Easy to Embed*: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
48
-
**High Quality*: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
45
+
-_High Performance_: Leveraging Rust and Arrow's memory model, DataFusion achieves very high performance
46
+
-_Easy to Connect_: Being part of the Apache Arrow ecosystem (Arrow, Parquet and Flight), DataFusion works well with the rest of the big data ecosystem
47
+
-_Easy to Embed_: Allowing extension at almost any point in its design, DataFusion can be tailored for your specific usecase
48
+
-_High Quality_: Extensively tested, both by itself and with the rest of the Arrow ecosystem, DataFusion can be used as the foundation for production systems.
49
49
50
50
## Known Uses
51
51
52
52
Here are some of the projects known to use DataFusion:
(if you know of another project, please submit a PR to add a link!)
65
65
@@ -122,8 +122,6 @@ Both of these examples will produce
122
122
+---+--------+
123
123
```
124
124
125
-
126
-
127
125
## Using DataFusion as a library
128
126
129
127
DataFusion is [published on crates.io](https://crates.io/crates/datafusion), and is [well documented on docs.rs](https://docs.rs/datafusion/).
@@ -230,7 +228,6 @@ DataFusion also includes a simple command-line interactive SQL utility. See the
230
228
-[x] Parquet primitive types
231
229
-[ ] Parquet nested types
232
230
233
-
234
231
## Extensibility
235
232
236
233
DataFusion is designed to be extensible at all points. To that end, you can provide your own custom:
@@ -242,35 +239,32 @@ DataFusion is designed to be extensible at all points. To that end, you can prov
242
239
-[x] User Defined `LogicalPlan` nodes
243
240
-[x] User Defined `ExecutionPlan` nodes
244
241
245
-
246
242
# Supported SQL
247
243
248
244
This library currently supports many SQL constructs, including
249
245
250
-
*`CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
251
-
*`SELECT ... FROM ...` together with any expression
252
-
*`ALIAS` to name an expression
253
-
*`CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
254
-
* most mathematical unary and binary expressions such as `+`, `/`, `sqrt`, `tan`, `>=`.
255
-
*`WHERE` to filter
256
-
*`GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`
257
-
*`ORDER BY` together with an expression and optional `ASC` or `DESC` and also optional `NULLS FIRST` or `NULLS LAST`
258
-
246
+
-`CREATE EXTERNAL TABLE X STORED AS PARQUET LOCATION '...';` to register a table's locations
247
+
-`SELECT ... FROM ...` together with any expression
248
+
-`ALIAS` to name an expression
249
+
-`CAST` to change types, including e.g. `Timestamp(Nanosecond, None)`
250
+
- most mathematical unary and binary expressions such as `+`, `/`, `sqrt`, `tan`, `>=`.
251
+
-`WHERE` to filter
252
+
-`GROUP BY` together with one of the following aggregations: `MIN`, `MAX`, `COUNT`, `SUM`, `AVG`
253
+
-`ORDER BY` together with an expression and optional `ASC` or `DESC` and also optional `NULLS FIRST` or `NULLS LAST`
259
254
260
255
## Supported Functions
261
256
262
257
DataFusion strives to implement a subset of the [PostgreSQL SQL dialect](https://www.postgresql.org/docs/current/functions.html) where possible. We explicitly choose a single dialect to maximize interoperability with other tools and allow reuse of the PostgreSQL documents and tutorials as much as possible.
263
258
264
-
Currently, only a subset of the PosgreSQL dialect is implemented, and we will document any deviations.
259
+
Currently, only a subset of the PostgreSQL dialect is implemented, and we will document any deviations.
265
260
266
261
## Schema Metadata / Information Schema Support
267
262
268
263
DataFusion supports the showing metadata about the tables available. This information can be accessed using the views of the ISO SQL `information_schema` schema or the DataFusion specific `SHOW TABLES` and `SHOW COLUMNS` commands.
269
264
270
265
More information can be found in the [Postgres docs](https://www.postgresql.org/docs/13/infoschema-schema.html)).
271
266
272
-
273
-
To show tables available for use in DataFusion, use the `SHOW TABLES` command or the `information_schema.tables` view:
267
+
To show tables available for use in DataFusion, use the `SHOW TABLES` command or the `information_schema.tables` view:
274
268
275
269
```sql
276
270
> show tables;
@@ -291,7 +285,7 @@ To show tables available for use in DataFusion, use the `SHOW TABLES` command o
There is no formal document describing DataFusion's architecture yet, but the following presentations offer a good overview of its different components and how they interact together.
356
346
357
-
* (March 2021): The DataFusion architecture is described in *Query Engine Design and the Rust-Based DataFusion in Apache Arrow*: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
358
-
* (Feburary 2021): How DataFusion is used within the Ballista Project is described in *Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
359
-
347
+
- (March 2021): The DataFusion architecture is described in _Query Engine Design and the Rust-Based DataFusion in Apache Arrow_: [recording](https://www.youtube.com/watch?v=K6eCAVEk4kU) (DataFusion content starts ~ 15 minutes in) and [slides](https://www.slideshare.net/influxdata/influxdb-iox-tech-talks-query-engine-design-and-the-rustbased-datafusion-in-apache-arrow-244161934)
348
+
- (Feburary 2021): How DataFusion is used within the Ballista Project is described in \*Ballista: Distributed Compute with Rust and Apache Arrow: [recording](https://www.youtube.com/watch?v=ZZHQaOap9pQ)
0 commit comments