Questions about catalog design #8887
-
/// Represent a list of named catalogs
pub trait CatalogList: Sync + Send {
/// Returns the catalog list as [`Any`]
/// so that it can be downcast to a specific implementation.
fn as_any(&self) -> &dyn Any;
/// Adds a new catalog to this catalog list
/// If a catalog of the same name existed before, it is replaced in the list and returned.
fn register_catalog(
&self,
name: String,
catalog: Arc<dyn CatalogProvider>,
) -> Option<Arc<dyn CatalogProvider>>;
/// Retrieves the list of available catalog names
fn catalog_names(&self) -> Vec<String>;
/// Retrieves a specific catalog by name, provided it exists.
fn catalog(&self, name: &str) -> Option<Arc<dyn CatalogProvider>>;
}
/// Represents a catalog, comprising a number of named schemas.
pub trait CatalogProvider: Sync + Send {
/// Returns the catalog provider as [`Any`]
/// so that it can be downcast to a specific implementation.
fn as_any(&self) -> &dyn Any;
/// Retrieves the list of available schema names in this catalog.
fn schema_names(&self) -> Vec<String>;
/// Retrieves a specific schema from the catalog by name, provided it exists.
fn schema(&self, name: &str) -> Option<Arc<dyn SchemaProvider>>;
/// Adds a new schema to this catalog.
///
/// If a schema of the same name existed before, it is replaced in
/// the catalog and returned.
///
/// By default returns a "Not Implemented" error
fn register_schema(
&self,
name: &str,
schema: Arc<dyn SchemaProvider>,
) -> Result<Option<Arc<dyn SchemaProvider>>> {
// use variables to avoid unused variable warnings
let _ = name;
let _ = schema;
not_impl_err!("Registering new schemas is not supported")
}
/// Removes a schema from this catalog. Implementations of this method should return
/// errors if the schema exists but cannot be dropped. For example, in DataFusion's
/// default in-memory catalog, [`MemoryCatalogProvider`], a non-empty schema
/// will only be successfully dropped when `cascade` is true.
/// This is equivalent to how DROP SCHEMA works in PostgreSQL.
///
/// Implementations of this method should return None if schema with `name`
/// does not exist.
///
/// By default returns a "Not Implemented" error
fn deregister_schema(
&self,
_name: &str,
_cascade: bool,
) -> Result<Option<Arc<dyn SchemaProvider>>> {
not_impl_err!("Deregistering new schemas is not supported")
}
}
/// Represents a schema, comprising a number of named tables.
#[async_trait]
pub trait SchemaProvider: Sync + Send {
/// Returns the schema provider as [`Any`](std::any::Any)
/// so that it can be downcast to a specific implementation.
fn as_any(&self) -> &dyn Any;
/// Retrieves the list of available table names in this schema.
fn table_names(&self) -> Vec<String>;
/// Retrieves a specific table from the schema by name, provided it exists.
async fn table(&self, name: &str) -> Option<Arc<dyn TableProvider>>;
/// If supported by the implementation, adds a new table to this schema.
/// If a table of the same name existed before, it returns "Table already exists" error.
#[allow(unused_variables)]
fn register_table(
&self,
name: String,
table: Arc<dyn TableProvider>,
) -> Result<Option<Arc<dyn TableProvider>>> {
exec_err!("schema provider does not support registering tables")
}
/// If supported by the implementation, removes an existing table from this schema and returns it.
/// If no table of that name exists, returns Ok(None).
#[allow(unused_variables)]
fn deregister_table(&self, name: &str) -> Result<Option<Arc<dyn TableProvider>>> {
exec_err!("schema provider does not support deregistering tables")
}
/// If supported by the implementation, checks the table exist in the schema provider or not.
/// If no matched table in the schema provider, return false.
/// Otherwise, return true.
fn table_exist(&self, name: &str) -> bool;
}1、Why CatalogList doesn’t provide delete function |
Beta Was this translation helpful? Give feedback.
Replies: 4 comments 3 replies
-
I don't think there is a good reason. Perhaps you could file a new ticket with the request?
The theory is that using a remote procedure call to list table names and other functions is likely to be very poor performance (most systems would want to batch the access to the remote catalog). What DataFusion itself does to plan SQL querie is to walk over the query to find all schema / table references (in an Specifically here is the call that gets a snapshot of all references And then resolves them all here: https://github.com/apache/arrow-datafusion/blob/eb81ea299aa7e121bbe244e7e1ab56513d4ef800/datafusion/core/src/execution/context/mod.rs#L1678-L1688 This has come up a number of times and I will make a PR to try and clarify the rationale in the documentation 3、Why is only the table function of SchemaProvider asynchronous, and the rest of the functions are not asynchronous? Can the catalog and schema be obtained remotely? This is the same answer as 2 |
Beta Was this translation helpful? Give feedback.
-
|
I think the reasoning is largely historical, SchemaProvider originally was completely sync. #4607 added the necessary async shenanigans to make it so that SchemaProvider::table could be async, without forcing the planning machinery to also be, which given plannings highly recursive nature causes problems. However, this was not extended to other methods in order to keep the scope of the change down. I think other methods could possibly be made async, it is just a case of working the async through the various different traits and methods. Async is infuriatingly viral in this way, and so seemingly simple changes can quickly balloon into quite complex undertakings. I would not be surprised if making table_names async would require also making changes to the other catalog traits. However, as @alamb describes, the actual meat of planning is already decoupled from these traits, so this might not be totally intractable. |
Beta Was this translation helpful? Give feedback.
-
|
Here is a PR with more information / documentation about this: #8968 |
Beta Was this translation helpful? Give feedback.
-
|
Related: #8805 One thing that bothers me is that you're telling the Datafusion-using programmers they should walk the Statement and construct a suitable CatalogProvider, only for Datafusion to walk the Statement again and request things from the CatalogProvider. That seems a little silly. Also note that registering just the referred-to tables in a SchemaProvider is not enough, because My use case: most likely schema data is cached in-memory already, but fetching it could fail (e.g. data corruption on disk). I still haven't figured out the best design for doing the right amount of work ahead of time, and I'm suffering from potential |
Beta Was this translation helpful? Give feedback.
I don't think there is a good reason. Perhaps you could file a new ticket with the request?
The theory is that using a remote procedure call to list table names and other functions is likely to be very poor performance (most systems would want to batch the access to the remote catalog).
What DataFusion itself does to plan SQL querie is to walk over the query to find all schema / table references (in an
asyncfunction, which could potentially access a remote catalog) and then does plan…