feat: Added a new trait to expose SchemaProvider#1621
feat: Added a new trait to expose SchemaProvider#1621parmesant wants to merge 1 commit intoparseablehq:mainfrom
Conversation
WalkthroughThe changes introduce a pluggable schema provider system via the Changes
Sequence DiagramsequenceDiagram
participant Query
participant SCHEMA_PROVIDER as SCHEMA_PROVIDER<br/>(Global Hook)
participant Provider as ParseableSchemaProvider<br/>(Custom/Default)
participant SessionContext
participant Storage
Query->>SCHEMA_PROVIDER: Check if custom provider configured
alt Custom provider set
SCHEMA_PROVIDER->>Provider: Get configured provider instance
else Default behavior
SCHEMA_PROVIDER->>Provider: Use default GlobalSchemaProvider
end
Provider->>Provider: new_provider(storage, tenant_id)
Provider-->>SessionContext: Return SchemaProvider instance
SessionContext->>Storage: Register schema with provider
SessionContext-->>Query: Schema ready for query execution
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/query/mod.rs (1)
77-100:⚠️ Potential issue | 🟠 MajorPrevent late
SCHEMA_PROVIDERregistration from silently no-oping.
SCHEMA_PROVIDERis only consulted when schemas are registered, butQUERY_SESSIONis a process-wideLazy. If the cell is set after the firstQUERY_SESSIONaccess, the defaultGlobalSchemaProviderremains registered for the lifetime of that session, so the new extension point never takes effect for existing schemas. Please either enforce provider initialization before any session access or rebuild the session context when the provider is installed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/query/mod.rs` around lines 77 - 100, SCHEMA_PROVIDER can be registered too late and never affect the process-wide QUERY_SESSION (and QUERY_SESSION_STATE) because QUERY_SESSION is a Lazy created with the old provider; fix by ensuring provider initialization happens before any session access or by rebuilding the session when a provider is installed: update the code that sets SCHEMA_PROVIDER to, after successful OnceCell::set, call Query::create_session_context(PARSEABLE.storage()) and replace the stored session/context (QUERY_SESSION or its InMemorySessionContext.session_context) and likewise refresh QUERY_SESSION_STATE via Query::create_session_state(...) so the new ParseableSchemaProvider takes effect for existing sessions.
🧹 Nitpick comments (2)
src/lib.rs (1)
59-64: Clarify stability expectations for these new crate-root re-exports.At Line 59, Line 60, Line 61, Line 62, Line 64, and Line 73, these
pub useadditions expand the public API surface. Please document whether they are part of a stable contract (or move them under a dedicated namespace) to avoid accidental long-term semver lock-in.Also applies to: 73-73
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/lib.rs` around lines 59 - 64, These new crate-root re-exports (arrow_array, arrow_flight, arrow_ipc, catalog as parseable_catalog, datafusion, datafusion_proto) expand the public API surface; either mark them explicitly as unstable/internal or move them under a dedicated namespace/module (e.g., reexports::arrow::*) and add a clear doc-comment on each symbol indicating stability guarantees (stable API vs internal/experimental) so consumers won’t be accidentally semver-locked; update the lib.rs entries for the listed pub use items to point to the new module or add #[doc = "... stability: ..."] comments and/or cfg(feature = "internal-reexports") gating as appropriate.src/query/stream_schema_provider.rs (1)
701-701: Document the new public helper.
try_from_expris now part of the public surface. Please add rustdoc describing the accepted expression shapes and thetime_partition == Nonebehavior so custom schema providers can depend on it safely.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/query/stream_schema_provider.rs` at line 701, Add Rustdoc for the newly public helper try_from_expr in stream_schema_provider.rs: describe the accepted expression shapes (e.g., exact/match on column names, literal values, supported Expr variants) and any constraints the function assumes from Expr, and explicitly document the behavior when time_partition is None (what is returned/assumed and how partition-sensitive logic behaves). Reference the function name try_from_expr and the Expr type in the doc so external/custom schema providers know how to call it and what results or None means; keep the doc concise and include examples of expected expression shapes in prose.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/query/mod.rs`:
- Around line 211-239: The schema provider registration is using
PARSEABLE.storage().get_object_store() instead of the storage passed into
create_session_context(storage), so schemas get registered against the wrong
backend; update both branches where SCHEMA_PROVIDER.new_provider and
GlobalSchemaProvider are constructed to use the caller's storage (the function
parameter named storage) by passing Some(storage.get_object_store()) or
storage.get_object_store() and tenant_id as before, and then call
catalog.register_schema as-is so the catalog is registered against the provided
storage rather than PARSEABLE.storage().
---
Outside diff comments:
In `@src/query/mod.rs`:
- Around line 77-100: SCHEMA_PROVIDER can be registered too late and never
affect the process-wide QUERY_SESSION (and QUERY_SESSION_STATE) because
QUERY_SESSION is a Lazy created with the old provider; fix by ensuring provider
initialization happens before any session access or by rebuilding the session
when a provider is installed: update the code that sets SCHEMA_PROVIDER to,
after successful OnceCell::set, call
Query::create_session_context(PARSEABLE.storage()) and replace the stored
session/context (QUERY_SESSION or its InMemorySessionContext.session_context)
and likewise refresh QUERY_SESSION_STATE via Query::create_session_state(...) so
the new ParseableSchemaProvider takes effect for existing sessions.
---
Nitpick comments:
In `@src/lib.rs`:
- Around line 59-64: These new crate-root re-exports (arrow_array, arrow_flight,
arrow_ipc, catalog as parseable_catalog, datafusion, datafusion_proto) expand
the public API surface; either mark them explicitly as unstable/internal or move
them under a dedicated namespace/module (e.g., reexports::arrow::*) and add a
clear doc-comment on each symbol indicating stability guarantees (stable API vs
internal/experimental) so consumers won’t be accidentally semver-locked; update
the lib.rs entries for the listed pub use items to point to the new module or
add #[doc = "... stability: ..."] comments and/or cfg(feature =
"internal-reexports") gating as appropriate.
In `@src/query/stream_schema_provider.rs`:
- Line 701: Add Rustdoc for the newly public helper try_from_expr in
stream_schema_provider.rs: describe the accepted expression shapes (e.g.,
exact/match on column names, literal values, supported Expr variants) and any
constraints the function assumes from Expr, and explicitly document the behavior
when time_partition is None (what is returned/assumed and how
partition-sensitive logic behaves). Reference the function name try_from_expr
and the Expr type in the doc so external/custom schema providers know how to
call it and what results or None means; keep the doc concise and include
examples of expected expression shapes in prose.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository UI
Review profile: CHILL
Plan: Pro
Run ID: 2b7d7002-715b-4e6c-a3c9-30b2bb3d8300
📒 Files selected for processing (4)
Cargo.tomlsrc/lib.rssrc/query/mod.rssrc/query/stream_schema_provider.rs
| let schema_provider = if let Some(provider) = SCHEMA_PROVIDER.get() { | ||
| provider.new_provider( | ||
| Some(PARSEABLE.storage().get_object_store()), | ||
| &Some(t.to_owned()), | ||
| ) | ||
| } else { | ||
| Box::new(GlobalSchemaProvider { | ||
| storage: PARSEABLE.storage().get_object_store(), | ||
| tenant_id: Some(t.to_owned()), | ||
| }) | ||
| }; | ||
| let _ = catalog.register_schema(t, schema_provider.into()); | ||
| } | ||
| } | ||
| } else { | ||
| // register just one schema | ||
| let schema_provider = Arc::new(GlobalSchemaProvider { | ||
| storage: storage.get_object_store(), | ||
| tenant_id: None, | ||
| }); | ||
| let schema_provider = if let Some(provider) = SCHEMA_PROVIDER.get() { | ||
| provider.new_provider(Some(PARSEABLE.storage().get_object_store()), &None) | ||
| } else { | ||
| Box::new(GlobalSchemaProvider { | ||
| storage: PARSEABLE.storage().get_object_store(), | ||
| tenant_id: None, | ||
| }) | ||
| }; | ||
|
|
||
| let _ = catalog.register_schema( | ||
| &state.config_options().catalog.default_schema, | ||
| schema_provider, | ||
| schema_provider.into(), | ||
| ); |
There was a problem hiding this comment.
Pass the caller's storage into the schema provider.
create_session_context(storage) accepts an ObjectStorageProvider, but both branches still hand PARSEABLE.storage().get_object_store() to the provider. Any caller creating a context over a different storage backend will register schemas against the wrong object store.
Suggested fix
pub fn create_session_context(storage: Arc<dyn ObjectStorageProvider>) -> SessionContext {
let state = Self::create_session_state(storage.clone());
+ let object_store = storage.get_object_store();
let catalog = state
.catalog_list()
.catalog(&state.config_options().catalog.default_catalog)
.expect("default catalog is provided by datafusion");
if PARSEABLE.options.is_multi_tenant() {
if let Some(tenants) = PARSEABLE.list_tenants() {
for t in tenants.iter() {
let schema_provider = if let Some(provider) = SCHEMA_PROVIDER.get() {
provider.new_provider(
- Some(PARSEABLE.storage().get_object_store()),
+ Some(object_store.clone()),
&Some(t.to_owned()),
)
} else {
Box::new(GlobalSchemaProvider {
- storage: PARSEABLE.storage().get_object_store(),
+ storage: object_store.clone(),
tenant_id: Some(t.to_owned()),
})
};
let _ = catalog.register_schema(t, schema_provider.into());
}
}
} else {
let schema_provider = if let Some(provider) = SCHEMA_PROVIDER.get() {
- provider.new_provider(Some(PARSEABLE.storage().get_object_store()), &None)
+ provider.new_provider(Some(object_store.clone()), &None)
} else {
Box::new(GlobalSchemaProvider {
- storage: PARSEABLE.storage().get_object_store(),
+ storage: object_store,
tenant_id: None,
})
};
let _ = catalog.register_schema(
&state.config_options().catalog.default_schema,
schema_provider.into(),
);
}
SessionContext::new_with_state(state)
}🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@src/query/mod.rs` around lines 211 - 239, The schema provider registration is
using PARSEABLE.storage().get_object_store() instead of the storage passed into
create_session_context(storage), so schemas get registered against the wrong
backend; update both branches where SCHEMA_PROVIDER.new_provider and
GlobalSchemaProvider are constructed to use the caller's storage (the function
parameter named storage) by passing Some(storage.get_object_store()) or
storage.get_object_store() and tenant_id as before, and then call
catalog.register_schema as-is so the catalog is registered against the provided
storage rather than PARSEABLE.storage().
Fixes #XXXX.
Description
This PR has:
Summary by CodeRabbit
New Features
Chores