Questions about catalog design #8887

web3creator · 2024-01-17T05:18:09Z

web3creator
Jan 17, 2024

/// Represent a list of named catalogs
pub trait CatalogList: Sync + Send {
    /// Returns the catalog list as [`Any`]
    /// so that it can be downcast to a specific implementation.
    fn as_any(&self) -> &dyn Any;

    /// Adds a new catalog to this catalog list
    /// If a catalog of the same name existed before, it is replaced in the list and returned.
    fn register_catalog(
        &self,
        name: String,
        catalog: Arc<dyn CatalogProvider>,
    ) -> Option<Arc<dyn CatalogProvider>>;

    /// Retrieves the list of available catalog names
    fn catalog_names(&self) -> Vec<String>;

    /// Retrieves a specific catalog by name, provided it exists.
    fn catalog(&self, name: &str) -> Option<Arc<dyn CatalogProvider>>;
}



/// Represents a catalog, comprising a number of named schemas.
pub trait CatalogProvider: Sync + Send {
    /// Returns the catalog provider as [`Any`]
    /// so that it can be downcast to a specific implementation.
    fn as_any(&self) -> &dyn Any;

    /// Retrieves the list of available schema names in this catalog.
    fn schema_names(&self) -> Vec<String>;

    /// Retrieves a specific schema from the catalog by name, provided it exists.
    fn schema(&self, name: &str) -> Option<Arc<dyn SchemaProvider>>;

    /// Adds a new schema to this catalog.
    ///
    /// If a schema of the same name existed before, it is replaced in
    /// the catalog and returned.
    ///
    /// By default returns a "Not Implemented" error
    fn register_schema(
        &self,
        name: &str,
        schema: Arc<dyn SchemaProvider>,
    ) -> Result<Option<Arc<dyn SchemaProvider>>> {
        // use variables to avoid unused variable warnings
        let _ = name;
        let _ = schema;
        not_impl_err!("Registering new schemas is not supported")
    }

    /// Removes a schema from this catalog. Implementations of this method should return
    /// errors if the schema exists but cannot be dropped. For example, in DataFusion's
    /// default in-memory catalog, [`MemoryCatalogProvider`], a non-empty schema
    /// will only be successfully dropped when `cascade` is true.
    /// This is equivalent to how DROP SCHEMA works in PostgreSQL.
    ///
    /// Implementations of this method should return None if schema with `name`
    /// does not exist.
    ///
    /// By default returns a "Not Implemented" error
    fn deregister_schema(
        &self,
        _name: &str,
        _cascade: bool,
    ) -> Result<Option<Arc<dyn SchemaProvider>>> {
        not_impl_err!("Deregistering new schemas is not supported")
    }
}

/// Represents a schema, comprising a number of named tables.
#[async_trait]
pub trait SchemaProvider: Sync + Send {
    /// Returns the schema provider as [`Any`](std::any::Any)
    /// so that it can be downcast to a specific implementation.
    fn as_any(&self) -> &dyn Any;

    /// Retrieves the list of available table names in this schema.
    fn table_names(&self) -> Vec<String>;

    /// Retrieves a specific table from the schema by name, provided it exists.
    async fn table(&self, name: &str) -> Option<Arc<dyn TableProvider>>;

    /// If supported by the implementation, adds a new table to this schema.
    /// If a table of the same name existed before, it returns "Table already exists" error.
    #[allow(unused_variables)]
    fn register_table(
        &self,
        name: String,
        table: Arc<dyn TableProvider>,
    ) -> Result<Option<Arc<dyn TableProvider>>> {
        exec_err!("schema provider does not support registering tables")
    }

    /// If supported by the implementation, removes an existing table from this schema and returns it.
    /// If no table of that name exists, returns Ok(None).
    #[allow(unused_variables)]
    fn deregister_table(&self, name: &str) -> Result<Option<Arc<dyn TableProvider>>> {
        exec_err!("schema provider does not support deregistering tables")
    }

    /// If supported by the implementation, checks the table exist in the schema provider or not.
    /// If no matched table in the schema provider, return false.
    /// Otherwise, return true.
    fn table_exist(&self, name: &str) -> bool;
}

1、Why CatalogList doesn’t provide delete function
2、Why is only SchemaProvider's table function asynchronous and SchemaProvider.table_names 、CatalogList.catalog_names not allowed to get data from remote?
3、Why is only the table function of SchemaProvider asynchronous, and the rest of the functions are not asynchronous? Can the catalog and schema be obtained remotely?

Answered by alamb

Jan 17, 2024

1 Why CatalogList doesn’t provide delete function

I don't think there is a good reason. Perhaps you could file a new ticket with the request?

Why is only SchemaProvider's table function asynchronous and SchemaProvider.table_names 、CatalogList.catalog_names not allowed to get data from remote?

The theory is that using a remote procedure call to list table names and other functions is likely to be very poor performance (most systems would want to batch the access to the remote catalog).

What DataFusion itself does to plan SQL querie is to walk over the query to find all schema / table references (in an async function, which could potentially access a remote catalog) and then does plan…

View full answer

alamb · 2024-01-17T21:36:38Z

alamb
Jan 17, 2024
Collaborator

1 Why CatalogList doesn’t provide delete function

I don't think there is a good reason. Perhaps you could file a new ticket with the request?

Why is only SchemaProvider's table function asynchronous and SchemaProvider.table_names 、CatalogList.catalog_names not allowed to get data from remote?

The theory is that using a remote procedure call to list table names and other functions is likely to be very poor performance (most systems would want to batch the access to the remote catalog).

What DataFusion itself does to plan SQL querie is to walk over the query to find all schema / table references (in an async function, which could potentially access a remote catalog) and then does planning with a snapshot of that.

Specifically here is the call that gets a snapshot of all references
https://github.com/apache/arrow-datafusion/blob/eb81ea299aa7e121bbe244e7e1ab56513d4ef800/datafusion/core/src/execution/context/mod.rs#L1667

And then resolves them all here: https://github.com/apache/arrow-datafusion/blob/eb81ea299aa7e121bbe244e7e1ab56513d4ef800/datafusion/core/src/execution/context/mod.rs#L1678-L1688

This has come up a number of times and I will make a PR to try and clarify the rationale in the documentation

3、Why is only the table function of SchemaProvider asynchronous, and the rest of the functions are not asynchronous? Can the catalog and schema be obtained remotely?

This is the same answer as 2

0 replies

tustvold · 2024-01-18T00:42:50Z

tustvold
Jan 18, 2024
Collaborator

I think the reasoning is largely historical, SchemaProvider originally was completely sync. #4607 added the necessary async shenanigans to make it so that SchemaProvider::table could be async, without forcing the planning machinery to also be, which given plannings highly recursive nature causes problems. However, this was not extended to other methods in order to keep the scope of the change down.

I think other methods could possibly be made async, it is just a case of working the async through the various different traits and methods. Async is infuriatingly viral in this way, and so seemingly simple changes can quickly balloon into quite complex undertakings. I would not be surprised if making table_names async would require also making changes to the other catalog traits. However, as @alamb describes, the actual meat of planning is already decoupled from these traits, so this might not be totally intractable.

0 replies

alamb · 2024-01-23T15:56:49Z

alamb
Jan 23, 2024
Collaborator

Here is a PR with more information / documentation about this: #8968

0 replies

tv42 · 2024-06-18T18:34:48Z

tv42
Jun 18, 2024

Related: #8805

One thing that bothers me is that you're telling the Datafusion-using programmers they should walk the Statement and construct a suitable CatalogProvider, only for Datafusion to walk the Statement again and request things from the CatalogProvider. That seems a little silly.

Also note that registering just the referred-to tables in a SchemaProvider is not enough, because information_schema should contain all of them. I guess you could detect that a Statement refers to information_schema and in that case load all table metadata?

My use case: most likely schema data is cached in-memory already, but fetching it could fail (e.g. data corruption on disk). I still haven't figured out the best design for doing the right amount of work ahead of time, and I'm suffering from potential unwrap panics in functions that aren't allowed to return errors.

3 replies

alamb Jun 23, 2024
Collaborator

I think we would welcome help improving the API for your usecase. Perhaps you could make an example program showing what you are trying to do in more detail so we can evaluate some improvements

Not sure if it is 100% related but there are some recent discussions on making some of the API more public

https://discord.com/channels/885562378132000778/1166447479609376850/1254114536013824091

tv42 Jun 24, 2024

For sure. I don't have my ducks in a row well enough yet to have a proposal, but I can't stop having shower thoughts about it.

alamb Jun 24, 2024
Collaborator

That is a good sign -- I think some of my best work has started with 🚿 thoughts 😆

Questions about catalog design #8887

Uh oh!

Uh oh!

web3creator Jan 17, 2024

Replies: 4 comments · 3 replies

Uh oh!

alamb Jan 17, 2024 Collaborator

Uh oh!

Uh oh!

tustvold Jan 18, 2024 Collaborator

Uh oh!

alamb Jan 23, 2024 Collaborator

Uh oh!

Uh oh!

tv42 Jun 18, 2024

Uh oh!

alamb Jun 23, 2024 Collaborator

Uh oh!

tv42 Jun 24, 2024

Uh oh!

alamb Jun 24, 2024 Collaborator

web3creator
Jan 17, 2024

Replies: 4 comments 3 replies

alamb
Jan 17, 2024
Collaborator

tustvold
Jan 18, 2024
Collaborator

alamb
Jan 23, 2024
Collaborator

tv42
Jun 18, 2024

alamb Jun 23, 2024
Collaborator

alamb Jun 24, 2024
Collaborator