Skip to content

feat: Added a new trait to expose SchemaProvider#1621

Open
parmesant wants to merge 1 commit intoparseablehq:mainfrom
parmesant:query-updates
Open

feat: Added a new trait to expose SchemaProvider#1621
parmesant wants to merge 1 commit intoparseablehq:mainfrom
parmesant:query-updates

Conversation

@parmesant
Copy link
Copy Markdown
Contributor

@parmesant parmesant commented Apr 15, 2026

Fixes #XXXX.

Description


This PR has:

  • been tested to ensure log ingestion and log query works.
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added documentation for new or modified features or behaviors.

Summary by CodeRabbit

  • New Features

    • Added customizable schema provider system for flexible schema management across different use cases
  • Chores

    • Updated dependencies to support additional data interchange formats
    • Enhanced public API for improved library integration and extensibility

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 15, 2026

Walkthrough

The changes introduce a pluggable schema provider system via the ParseableSchemaProvider trait and global configuration hook, expand the public API through new module re-exports in lib.rs, increase visibility of internal query methods, and add a new dependency on datafusion-proto.

Changes

Cohort / File(s) Summary
Dependency Management
Cargo.toml
Added direct dependency datafusion-proto = "51.0.0" alongside existing datafusion dependency.
Public API Expansion
src/lib.rs
Introduced six new public re-exports: arrow_array, arrow_flight, arrow_ipc, datafusion_proto, and aliased exports for catalog (as parseable_catalog) and utils (as parseable_utils).
Schema Provider System
src/query/mod.rs, src/query/stream_schema_provider.rs
Implemented pluggable schema provider architecture by adding ParseableSchemaProvider trait and SCHEMA_PROVIDER global static; updated schema registration to use custom provider when configured; made create_session_state public; refactored Query::execute to reduce repeated context lookups; exposed PartialTimeFilter::try_from_expr as public API.

Sequence Diagram

sequenceDiagram
    participant Query
    participant SCHEMA_PROVIDER as SCHEMA_PROVIDER<br/>(Global Hook)
    participant Provider as ParseableSchemaProvider<br/>(Custom/Default)
    participant SessionContext
    participant Storage

    Query->>SCHEMA_PROVIDER: Check if custom provider configured
    alt Custom provider set
        SCHEMA_PROVIDER->>Provider: Get configured provider instance
    else Default behavior
        SCHEMA_PROVIDER->>Provider: Use default GlobalSchemaProvider
    end
    Provider->>Provider: new_provider(storage, tenant_id)
    Provider-->>SessionContext: Return SchemaProvider instance
    SessionContext->>Storage: Register schema with provider
    SessionContext-->>Query: Schema ready for query execution
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Suggested reviewers

  • nitisht

Poem

🐰 A schema hooks into place, quite neat,
The provider pattern makes it complete,
With traits and statics taking flight,
The API grows—what a delight!
Re-exports bloom, the crate's so bright! 🌿

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name Status Explanation Resolution
Description check ⚠️ Warning The PR description uses only the template structure with no actual content filled in—all sections remain empty or as placeholders, providing no meaningful information about goals, rationale, or key changes. Fill in the Description section with the PR goal, rationale for the changes, and specific key changes made. Update the checklist to reflect whether testing and documentation requirements were met.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main feature added: a new trait to expose SchemaProvider, which aligns with the primary changes in src/query/mod.rs.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/query/mod.rs (1)

77-100: ⚠️ Potential issue | 🟠 Major

Prevent late SCHEMA_PROVIDER registration from silently no-oping.

SCHEMA_PROVIDER is only consulted when schemas are registered, but QUERY_SESSION is a process-wide Lazy. If the cell is set after the first QUERY_SESSION access, the default GlobalSchemaProvider remains registered for the lifetime of that session, so the new extension point never takes effect for existing schemas. Please either enforce provider initialization before any session access or rebuild the session context when the provider is installed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/query/mod.rs` around lines 77 - 100, SCHEMA_PROVIDER can be registered
too late and never affect the process-wide QUERY_SESSION (and
QUERY_SESSION_STATE) because QUERY_SESSION is a Lazy created with the old
provider; fix by ensuring provider initialization happens before any session
access or by rebuilding the session when a provider is installed: update the
code that sets SCHEMA_PROVIDER to, after successful OnceCell::set, call
Query::create_session_context(PARSEABLE.storage()) and replace the stored
session/context (QUERY_SESSION or its InMemorySessionContext.session_context)
and likewise refresh QUERY_SESSION_STATE via Query::create_session_state(...) so
the new ParseableSchemaProvider takes effect for existing sessions.
🧹 Nitpick comments (2)
src/lib.rs (1)

59-64: Clarify stability expectations for these new crate-root re-exports.

At Line 59, Line 60, Line 61, Line 62, Line 64, and Line 73, these pub use additions expand the public API surface. Please document whether they are part of a stable contract (or move them under a dedicated namespace) to avoid accidental long-term semver lock-in.

Also applies to: 73-73

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/lib.rs` around lines 59 - 64, These new crate-root re-exports
(arrow_array, arrow_flight, arrow_ipc, catalog as parseable_catalog, datafusion,
datafusion_proto) expand the public API surface; either mark them explicitly as
unstable/internal or move them under a dedicated namespace/module (e.g.,
reexports::arrow::*) and add a clear doc-comment on each symbol indicating
stability guarantees (stable API vs internal/experimental) so consumers won’t be
accidentally semver-locked; update the lib.rs entries for the listed pub use
items to point to the new module or add #[doc = "... stability: ..."] comments
and/or cfg(feature = "internal-reexports") gating as appropriate.
src/query/stream_schema_provider.rs (1)

701-701: Document the new public helper.

try_from_expr is now part of the public surface. Please add rustdoc describing the accepted expression shapes and the time_partition == None behavior so custom schema providers can depend on it safely.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/query/stream_schema_provider.rs` at line 701, Add Rustdoc for the newly
public helper try_from_expr in stream_schema_provider.rs: describe the accepted
expression shapes (e.g., exact/match on column names, literal values, supported
Expr variants) and any constraints the function assumes from Expr, and
explicitly document the behavior when time_partition is None (what is
returned/assumed and how partition-sensitive logic behaves). Reference the
function name try_from_expr and the Expr type in the doc so external/custom
schema providers know how to call it and what results or None means; keep the
doc concise and include examples of expected expression shapes in prose.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@src/query/mod.rs`:
- Around line 211-239: The schema provider registration is using
PARSEABLE.storage().get_object_store() instead of the storage passed into
create_session_context(storage), so schemas get registered against the wrong
backend; update both branches where SCHEMA_PROVIDER.new_provider and
GlobalSchemaProvider are constructed to use the caller's storage (the function
parameter named storage) by passing Some(storage.get_object_store()) or
storage.get_object_store() and tenant_id as before, and then call
catalog.register_schema as-is so the catalog is registered against the provided
storage rather than PARSEABLE.storage().

---

Outside diff comments:
In `@src/query/mod.rs`:
- Around line 77-100: SCHEMA_PROVIDER can be registered too late and never
affect the process-wide QUERY_SESSION (and QUERY_SESSION_STATE) because
QUERY_SESSION is a Lazy created with the old provider; fix by ensuring provider
initialization happens before any session access or by rebuilding the session
when a provider is installed: update the code that sets SCHEMA_PROVIDER to,
after successful OnceCell::set, call
Query::create_session_context(PARSEABLE.storage()) and replace the stored
session/context (QUERY_SESSION or its InMemorySessionContext.session_context)
and likewise refresh QUERY_SESSION_STATE via Query::create_session_state(...) so
the new ParseableSchemaProvider takes effect for existing sessions.

---

Nitpick comments:
In `@src/lib.rs`:
- Around line 59-64: These new crate-root re-exports (arrow_array, arrow_flight,
arrow_ipc, catalog as parseable_catalog, datafusion, datafusion_proto) expand
the public API surface; either mark them explicitly as unstable/internal or move
them under a dedicated namespace/module (e.g., reexports::arrow::*) and add a
clear doc-comment on each symbol indicating stability guarantees (stable API vs
internal/experimental) so consumers won’t be accidentally semver-locked; update
the lib.rs entries for the listed pub use items to point to the new module or
add #[doc = "... stability: ..."] comments and/or cfg(feature =
"internal-reexports") gating as appropriate.

In `@src/query/stream_schema_provider.rs`:
- Line 701: Add Rustdoc for the newly public helper try_from_expr in
stream_schema_provider.rs: describe the accepted expression shapes (e.g.,
exact/match on column names, literal values, supported Expr variants) and any
constraints the function assumes from Expr, and explicitly document the behavior
when time_partition is None (what is returned/assumed and how
partition-sensitive logic behaves). Reference the function name try_from_expr
and the Expr type in the doc so external/custom schema providers know how to
call it and what results or None means; keep the doc concise and include
examples of expected expression shapes in prose.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Repository UI

Review profile: CHILL

Plan: Pro

Run ID: 2b7d7002-715b-4e6c-a3c9-30b2bb3d8300

📥 Commits

Reviewing files that changed from the base of the PR and between eacb1b9 and 2aa314b.

📒 Files selected for processing (4)
  • Cargo.toml
  • src/lib.rs
  • src/query/mod.rs
  • src/query/stream_schema_provider.rs

Comment thread src/query/mod.rs
Comment on lines +211 to 239
let schema_provider = if let Some(provider) = SCHEMA_PROVIDER.get() {
provider.new_provider(
Some(PARSEABLE.storage().get_object_store()),
&Some(t.to_owned()),
)
} else {
Box::new(GlobalSchemaProvider {
storage: PARSEABLE.storage().get_object_store(),
tenant_id: Some(t.to_owned()),
})
};
let _ = catalog.register_schema(t, schema_provider.into());
}
}
} else {
// register just one schema
let schema_provider = Arc::new(GlobalSchemaProvider {
storage: storage.get_object_store(),
tenant_id: None,
});
let schema_provider = if let Some(provider) = SCHEMA_PROVIDER.get() {
provider.new_provider(Some(PARSEABLE.storage().get_object_store()), &None)
} else {
Box::new(GlobalSchemaProvider {
storage: PARSEABLE.storage().get_object_store(),
tenant_id: None,
})
};

let _ = catalog.register_schema(
&state.config_options().catalog.default_schema,
schema_provider,
schema_provider.into(),
);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Pass the caller's storage into the schema provider.

create_session_context(storage) accepts an ObjectStorageProvider, but both branches still hand PARSEABLE.storage().get_object_store() to the provider. Any caller creating a context over a different storage backend will register schemas against the wrong object store.

Suggested fix
 pub fn create_session_context(storage: Arc<dyn ObjectStorageProvider>) -> SessionContext {
     let state = Self::create_session_state(storage.clone());
+    let object_store = storage.get_object_store();

     let catalog = state
         .catalog_list()
         .catalog(&state.config_options().catalog.default_catalog)
         .expect("default catalog is provided by datafusion");

     if PARSEABLE.options.is_multi_tenant() {
         if let Some(tenants) = PARSEABLE.list_tenants() {
             for t in tenants.iter() {
                 let schema_provider = if let Some(provider) = SCHEMA_PROVIDER.get() {
                     provider.new_provider(
-                        Some(PARSEABLE.storage().get_object_store()),
+                        Some(object_store.clone()),
                         &Some(t.to_owned()),
                     )
                 } else {
                     Box::new(GlobalSchemaProvider {
-                        storage: PARSEABLE.storage().get_object_store(),
+                        storage: object_store.clone(),
                         tenant_id: Some(t.to_owned()),
                     })
                 };
                 let _ = catalog.register_schema(t, schema_provider.into());
             }
         }
     } else {
         let schema_provider = if let Some(provider) = SCHEMA_PROVIDER.get() {
-            provider.new_provider(Some(PARSEABLE.storage().get_object_store()), &None)
+            provider.new_provider(Some(object_store.clone()), &None)
         } else {
             Box::new(GlobalSchemaProvider {
-                storage: PARSEABLE.storage().get_object_store(),
+                storage: object_store,
                 tenant_id: None,
             })
         };

         let _ = catalog.register_schema(
             &state.config_options().catalog.default_schema,
             schema_provider.into(),
         );
     }

     SessionContext::new_with_state(state)
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@src/query/mod.rs` around lines 211 - 239, The schema provider registration is
using PARSEABLE.storage().get_object_store() instead of the storage passed into
create_session_context(storage), so schemas get registered against the wrong
backend; update both branches where SCHEMA_PROVIDER.new_provider and
GlobalSchemaProvider are constructed to use the caller's storage (the function
parameter named storage) by passing Some(storage.get_object_store()) or
storage.get_object_store() and tenant_id as before, and then call
catalog.register_schema as-is so the catalog is registered against the provided
storage rather than PARSEABLE.storage().

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant