FindDataSourceTable Logical Resolution Rule

jaceklaskowski · jaceklaskowski · commit b60d3589b160 · 2024-11-17T13:20:30.000+01:00
diff --git a/docs/logical-analysis-rules/FindDataSourceTable.md b/docs/logical-analysis-rules/FindDataSourceTable.md
@@ -2,36 +2,103 @@
 title: FindDataSourceTable
 ---
 
-# FindDataSourceTable Logical Evaluation Rule
+# FindDataSourceTable Logical Resolution Rule
 
-`FindDataSourceTable` is a catalyst/Rule.md[Catalyst rule] for <<apply, resolving UnresolvedCatalogRelations>> (of Spark and Hive tables) in a logical query plan.
+`FindDataSourceTable` is a [Catalyst rule](../catalyst/Rule.md) to [resolve UnresolvedCatalogRelation logical operators](#apply) (of Spark and Hive tables) in a logical query plan (`Rule[LogicalPlan]`).
 
-`FindDataSourceTable` is part of [additional rules](../Analyzer.md#extendedResolutionRules) in `Resolution` fixed-point batch of rules.
+`FindDataSourceTable` is used by [Hive](../hive/HiveSessionStateBuilder.md#analyzer) and [Spark](../BaseSessionStateBuilder.md#analyzer) Analyzers as part of their [extendedResolutionRules](../Analyzer.md#extendedResolutionRules).
 
-[[sparkSession]][[creating-instance]]
-`FindDataSourceTable` takes a single [SparkSession](../SparkSession.md) to be created.
+## Creating Instance
+
+`FindDataSourceTable` takes the following to be created:
+
+* <span id="sparkSession"> [SparkSession](../SparkSession.md)
+
+`FindDataSourceTable` is created when:
+
+* `HiveSessionStateBuilder` is requested for the [Analyzer](../hive/HiveSessionStateBuilder.md#analyzer)
+* `BaseSessionStateBuilder` is requested for the [Analyzer](../BaseSessionStateBuilder.md#analyzer)
+
+## Execute Rule { #apply }
+
+??? note "Rule"
+
+    ```scala
+    apply(
+      plan: LogicalPlan): LogicalPlan
+    ```
+
+    `apply` is part of the [Rule](../catalyst/Rule.md#apply) abstraction.
+
+`apply` traverses the given [LogicalPlan](../logical-operators/LogicalPlan.md) (from top to leaves) to resolve `UnresolvedCatalogRelation`s of the following logical operators:
+
+1. [InsertIntoStatement](../logical-operators/InsertIntoStatement.md) with a non-streaming `UnresolvedCatalogRelation` of [Spark (DataSource) table](../connectors/DDLUtils.md#isDatasourceTable)
+1. [InsertIntoStatement](../logical-operators/InsertIntoStatement.md) with a non-streaming `UnresolvedCatalogRelation` of a Hive table
+1. [AppendData](../logical-operators/AppendData.md) (that is not [by name](../logical-operators/AppendData.md#isByName)) with a [DataSourceV2Relation](../logical-operators/DataSourceV2Relation.md) of [V1Table](../connector/V1Table.md)
+1. A non-streaming `UnresolvedCatalogRelation` of [Spark (DataSource) table](../connectors/DDLUtils.md#isDatasourceTable)
+1. A non-streaming `UnresolvedCatalogRelation` of a Hive table
+1. A streaming `UnresolvedCatalogRelation`
+1. A `StreamingRelationV2` ([Spark Structured Streaming]({{ book.structured_streaming }}/logical-operators/StreamingRelationV2/)) over a streaming `UnresolvedCatalogRelation`
+
+??? note "Streaming and Non-Streaming `UnresolvedCatalogRelation`s"
+    The difference between streaming and non-streaming `UnresolvedCatalogRelation`s is the [isStreaming](../logical-operators/LogicalPlan.md#isStreaming) flag that is disabled (`false`) by default.
+
+`apply`...FIXME
+
+### Create StreamingRelation  { #getStreamingRelation }
+
+```scala
+getStreamingRelation(
+  table: CatalogTable,
+  extraOptions: CaseInsensitiveStringMap): StreamingRelation
+```
+
+`getStreamingRelation` creates a `StreamingRelation` ([Spark Structured Streaming]({{ book.structured_streaming }}/logical-operators/StreamingRelation/)) with a [DataSource](../DataSource.md#creating-instance) with the following:
+
+Property | Value
+-|-
+ [DataSource provider](../DataSource.md#className) | The [provider](../CatalogTable.md#provider) of the given [CatalogTable](../CatalogTable.md)
+ [User-specified schema](../DataSource.md#userSpecifiedSchema) | The [schema](../CatalogTable.md#schema) of the given [CatalogTable](../CatalogTable.md)
+ [Options](../DataSource.md#options) | [DataSource options](../connectors/DataSourceUtils.md#generateDatasourceOptions) based on the given `extraOptions` and the [CatalogTable](../CatalogTable.md)
+ [CatalogTable](../DataSource.md#catalogTable) | The given [CatalogTable](../CatalogTable.md)
+
+---
+
+`getStreamingRelation` is used when:
+
+* `FindDataSourceTable` is requested to resolve streaming `UnresolvedCatalogRelation`s
+
+## Demo
 
 ```text
 scala> :type spark
 org.apache.spark.sql.SparkSession
+```
 
+```scala
 // Example: InsertIntoTable with UnresolvedCatalogRelation
 // Drop tables to make the example reproducible
 val db = spark.catalog.currentDatabase
 Seq("t1", "t2").foreach { t =>
   spark.sharedState.externalCatalog.dropTable(db, t, ignoreIfNotExists = true, purge = true)
 }
+```
 
+```scala
 // Create tables
 sql("CREATE TABLE t1 (id LONG) USING parquet")
 sql("CREATE TABLE t2 (id LONG) USING orc")
+```
 
+```text
 import org.apache.spark.sql.catalyst.dsl.plans._
 val plan = table("t1").insertInto(tableName = "t2", overwrite = true)
 scala> println(plan.numberedTreeString)
 00 'InsertIntoTable 'UnresolvedRelation `t2`, true, false
 01 +- 'UnresolvedRelation `t1`
+```
 
+```text
 // Transform the logical plan with ResolveRelations logical rule first
 // so UnresolvedRelations become UnresolvedCatalogRelations
 import spark.sessionState.analyzer.ResolveRelations
@@ -40,7 +107,9 @@ scala> println(planWithUnresolvedCatalogRelations.numberedTreeString)
 00 'InsertIntoTable 'UnresolvedRelation `t2`, true, false
 01 +- 'SubqueryAlias t1
 02    +- 'UnresolvedCatalogRelation `default`.`t1`, org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
+```
 
+```text
 // Let's resolve UnresolvedCatalogRelations then
 import org.apache.spark.sql.execution.datasources.FindDataSourceTable
 val r = new FindDataSourceTable(spark)
@@ -52,21 +121,17 @@ scala> println(tablesResolvedPlan.numberedTreeString)
 02    +- Relation[id#10L] parquet
 ```
 
-## <span id="apply"> Executing Rule
+<!---
+## Review Me
 
-```scala
-apply(
-  plan: LogicalPlan): LogicalPlan
-```
+## Executing Rule { #apply }
 
 `apply` resolves `UnresolvedCatalogRelation`s for Spark (Data Source) and Hive tables:
 
 * `apply` [creates HiveTableRelation logical operators](#readDataSourceTable) for `UnresolvedCatalogRelation`s of Spark tables (incl. `InsertIntoTable`s)
 
 * `apply` [creates LogicalRelation logical operators](#readHiveTable) for `InsertIntoTable`s with `UnresolvedCatalogRelation` of a Hive table or `UnresolvedCatalogRelation`s of a Hive table
 
-`apply` is part of [Rule](../catalyst/Rule.md#apply) contract.
-
 === [[readHiveTable]] Creating HiveTableRelation Logical Operator -- `readHiveTable` Internal Method
 
 [source, scala]
@@ -94,3 +159,4 @@ readDataSourceTable(
 If not available, `readDataSourceTable` [creates a new DataSource](../DataSource.md) for the [provider](../CatalogTable.md#provider) (of the input `CatalogTable`) with the extra `path` option (based on the `locationUri` of the [storage](../CatalogTable.md#storage) of the input `CatalogTable`). `readDataSourceTable` requests the `DataSource` to [resolve the relation and create a corresponding BaseRelation](../DataSource.md#resolveRelation) that is then used to create a [LogicalRelation](../logical-operators/LogicalRelation.md) with the input [CatalogTable](../CatalogTable.md).
 
 NOTE: `readDataSourceTable` is used when `FindDataSourceTable` is requested to <<apply, resolve an UnresolvedCatalogRelation in a logical plan>> (for data source tables).
+-->