Skip to content

Commit cd1d146

Browse files
alambcomphead
authored andcommitted
Improve DataFrame Users Guide (apache#11324)
* Improve `DataFrame` Users Guide * typo * Update docs/source/user-guide/dataframe.md Co-authored-by: Oleks V <[email protected]> --------- Co-authored-by: Oleks V <[email protected]>
1 parent df35358 commit cd1d146

File tree

2 files changed

+53
-76
lines changed

2 files changed

+53
-76
lines changed

datafusion/core/src/lib.rs

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -626,6 +626,12 @@ doc_comment::doctest!(
626626
user_guide_configs
627627
);
628628

629+
#[cfg(doctest)]
630+
doc_comment::doctest!(
631+
"../../../docs/source/user-guide/dataframe.md",
632+
user_guide_dataframe
633+
);
634+
629635
#[cfg(doctest)]
630636
doc_comment::doctest!(
631637
"../../../docs/source/user-guide/expressions.md",

docs/source/user-guide/dataframe.md

Lines changed: 47 additions & 76 deletions
Original file line numberDiff line numberDiff line change
@@ -19,17 +19,30 @@
1919

2020
# DataFrame API
2121

22-
A DataFrame represents a logical set of rows with the same named columns, similar to a [Pandas DataFrame](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html) or
23-
[Spark DataFrame](https://spark.apache.org/docs/latest/sql-programming-guide.html).
22+
A DataFrame represents a logical set of rows with the same named columns,
23+
similar to a [Pandas DataFrame] or [Spark DataFrame].
2424

25-
DataFrames are typically created by calling a method on
26-
`SessionContext`, such as `read_csv`, and can then be modified
27-
by calling the transformation methods, such as `filter`, `select`, `aggregate`, and `limit`
28-
to build up a query definition.
25+
DataFrames are typically created by calling a method on [`SessionContext`], such
26+
as [`read_csv`], and can then be modified by calling the transformation methods,
27+
such as [`filter`], [`select`], [`aggregate`], and [`limit`] to build up a query
28+
definition.
2929

30-
The query can be executed by calling the `collect` method.
30+
The query can be executed by calling the [`collect`] method.
3131

32-
The DataFrame struct is part of DataFusion's prelude and can be imported with the following statement.
32+
DataFusion DataFrames use lazy evaluation, meaning that each transformation
33+
creates a new plan but does not actually perform any immediate actions. This
34+
approach allows for the overall plan to be optimized before execution. The plan
35+
is evaluated (executed) when an action method is invoked, such as [`collect`].
36+
See the [Library Users Guide] for more details.
37+
38+
The DataFrame API is well documented in the [API reference on docs.rs].
39+
Please refer to the [Expressions Reference] for more information on
40+
building logical expressions (`Expr`) to use with the DataFrame API.
41+
42+
## Example
43+
44+
The DataFrame struct is part of DataFusion's `prelude` and can be imported with
45+
the following statement.
3346

3447
```rust
3548
use datafusion::prelude::*;
@@ -38,73 +51,31 @@ use datafusion::prelude::*;
3851
Here is a minimal example showing the execution of a query using the DataFrame API.
3952

4053
```rust
41-
let ctx = SessionContext::new();
42-
let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
43-
let df = df.filter(col("a").lt_eq(col("b")))?
44-
.aggregate(vec![col("a")], vec![min(col("b"))])?
45-
.limit(0, Some(100))?;
46-
// Print results
47-
df.show().await?;
54+
use datafusion::prelude::*;
55+
use datafusion::error::Result;
56+
57+
#[tokio::main]
58+
async fn main() -> Result<()> {
59+
let ctx = SessionContext::new();
60+
let df = ctx.read_csv("tests/data/example.csv", CsvReadOptions::new()).await?;
61+
let df = df.filter(col("a").lt_eq(col("b")))?
62+
.aggregate(vec![col("a")], vec![min(col("b"))])?
63+
.limit(0, Some(100))?;
64+
// Print results
65+
df.show().await?;
66+
Ok(())
67+
}
4868
```
4969

50-
The DataFrame API is well documented in the [API reference on docs.rs](https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html).
51-
52-
Refer to the [Expressions Reference](expressions) for available functions for building logical expressions for use with the
53-
DataFrame API.
54-
55-
## DataFrame Transformations
56-
57-
These methods create a new DataFrame after applying a transformation to the logical plan that the DataFrame represents.
58-
59-
DataFusion DataFrames use lazy evaluation, meaning that each transformation is just creating a new query plan and
60-
not actually performing any transformations. This approach allows for the overall plan to be optimized before
61-
execution. The plan is evaluated (executed) when an action method is invoked, such as `collect`.
62-
63-
| Function | Notes |
64-
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------ |
65-
| aggregate | Perform an aggregate query with optional grouping expressions. |
66-
| distinct | Filter out duplicate rows. |
67-
| distinct_on | Filter out duplicate rows based on provided expressions. |
68-
| drop_columns | Create a projection with all but the provided column names. |
69-
| except | Calculate the exception of two DataFrames. The two DataFrames must have exactly the same schema |
70-
| filter | Filter a DataFrame to only include rows that match the specified filter expression. |
71-
| intersect | Calculate the intersection of two DataFrames. The two DataFrames must have exactly the same schema |
72-
| join | Join this DataFrame with another DataFrame using the specified columns as join keys. |
73-
| join_on | Join this DataFrame with another DataFrame using arbitrary expressions. |
74-
| limit | Limit the number of rows returned from this DataFrame. |
75-
| repartition | Repartition a DataFrame based on a logical partitioning scheme. |
76-
| sort | Sort the DataFrame by the specified sorting expressions. Any expression can be turned into a sort expression by calling its `sort` method. |
77-
| select | Create a projection based on arbitrary expressions. Example: `df.select(vec![col("c1"), abs(col("c2"))])?` |
78-
| select_columns | Create a projection based on column names. Example: `df.select_columns(&["id", "name"])?`. |
79-
| union | Calculate the union of two DataFrames, preserving duplicate rows. The two DataFrames must have exactly the same schema. |
80-
| union_distinct | Calculate the distinct union of two DataFrames. The two DataFrames must have exactly the same schema. |
81-
| with_column | Add an additional column to the DataFrame. |
82-
| with_column_renamed | Rename one column by applying a new projection. |
83-
84-
## DataFrame Actions
85-
86-
These methods execute the logical plan represented by the DataFrame and either collects the results into memory, prints them to stdout, or writes them to disk.
87-
88-
| Function | Notes |
89-
| -------------------------- | --------------------------------------------------------------------------------------------------------------------------- |
90-
| collect | Executes this DataFrame and collects all results into a vector of RecordBatch. |
91-
| collect_partitioned | Executes this DataFrame and collects all results into a vector of vector of RecordBatch maintaining the input partitioning. |
92-
| count | Executes this DataFrame to get the total number of rows. |
93-
| execute_stream | Executes this DataFrame and returns a stream over a single partition. |
94-
| execute_stream_partitioned | Executes this DataFrame and returns one stream per partition. |
95-
| show | Execute this DataFrame and print the results to stdout. |
96-
| show_limit | Execute this DataFrame and print a subset of results to stdout. |
97-
| write_csv | Execute this DataFrame and write the results to disk in CSV format. |
98-
| write_json | Execute this DataFrame and write the results to disk in JSON format. |
99-
| write_parquet | Execute this DataFrame and write the results to disk in Parquet format. |
100-
| write_table | Execute this DataFrame and write the results via the insert_into method of the registered TableProvider |
101-
102-
## Other DataFrame Methods
103-
104-
| Function | Notes |
105-
| ------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
106-
| explain | Return a DataFrame with the explanation of its plan so far. |
107-
| registry | Return a `FunctionRegistry` used to plan udf's calls. |
108-
| schema | Returns the schema describing the output of this DataFrame in terms of columns returned, where each column has a name, data type, and nullability attribute. |
109-
| to_logical_plan | Return the optimized logical plan represented by this DataFrame. |
110-
| to_unoptimized_plan | Return the unoptimized logical plan represented by this DataFrame. |
70+
[pandas dataframe]: https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.html
71+
[spark dataframe]: https://spark.apache.org/docs/latest/sql-programming-guide.html
72+
[`sessioncontext`]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html
73+
[`read_csv`]: https://docs.rs/datafusion/latest/datafusion/execution/context/struct.SessionContext.html#method.read_csv
74+
[`filter`]: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.filter
75+
[`select`]: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.select
76+
[`aggregate`]: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.aggregate
77+
[`limit`]: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.limit
78+
[`collect`]: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html#method.collect
79+
[library users guide]: ../library-user-guide/using-the-dataframe-api.md
80+
[api reference on docs.rs]: https://docs.rs/datafusion/latest/datafusion/dataframe/struct.DataFrame.html
81+
[expressions reference]: expressions

0 commit comments

Comments
 (0)