Skip to content

Conversation

slvrtrn
Copy link
Contributor

@slvrtrn slvrtrn commented May 20, 2025

Summary

Warning

This PR implements RBWNAT for Query only; Insert should be implemented as a follow-up.

First of all, let's abbreviate RowBinaryWithNamesAndTypes format as RBWNAT, and the regular RowBinary as just RB for simplicity.

There is a significant amount of issues created in the repo regarding schema incompatibility or obscure error messages in the repository (see the full list below). The reason is that the deserialization is effectively implemented in a "data-driven" way, where the user structures dictate the way the stream in RB should be (de)serialized, so it is possible to have a hiccup where two UInt32 may be deserialized as a single UInt64, which in worst case scenario may lead to corrupted data. For example:

This test will deserialize a wrong value on the main branch, cause DateTime64 is streamed as 8 bytes (Int64), and 2x(U)Int32 are also streamed as 8 bytes in total. It correctly throws an error on this branch now with enabled validation mode.

#[tokio::test]
#[cfg(feature = "time")]
async fn test_serde_with() {
    #[derive(Debug, Row, Serialize, Deserialize, PartialEq)]
    struct Data {
        #[serde(with = "clickhouse::serde::time::datetime64::millis")]
        n1: OffsetDateTime, // underlying is still Int64; should not compose it from two (U)Int32
    }

    let client = prepare_database!().with_struct_validation_mode(StructValidationMode::EachRow);
    let result = client
        .query("SELECT 42 :: UInt32 AS n1, 144 :: Int32 AS n2")
        .fetch_one::<Data>()
        .await;

    assert!(result.is_err());
    assert!(matches!(
        result.unwrap_err(),
        Error::InvalidColumnDataType { .. }
    ));
}

This PR introduces:

  • RBWNAT format usage instead of RB by default, which allows for stronger type safety guarantess. It is still possible to disable the validation using Client::with_disabled_validation if you really, really want it. In that case, RowBinary format is used as before.
  • New types internal crate that contains utils to deal with RBWNAT and Native data types strings parsing into a proper AST. Rustified from https://github.com/ClickHouse/clickhouse-js/blob/main/packages/client-common/src/parse/column_types.ts, but not entirely. The most important part is the correctness and the tests, the actual implementation detail can be adjusted in the follow-up.
  • An ability to conveniently deserialize map as a HashMap<K, V>, and not only as Vec<(K, V)>, which was confusing.
  • Clearer error messages and panics when a schema mismatch is detected.
  • A lot of tests covering various schema-related scenarios (see rbwnat.rs).
  • Support for "shuffled" structure definition, where the order of the fields does not match the DB, but the names and types are correct. If the validator detects wrong order of the fields, deserialize_struct switches from SeqAccess to a custom MapAccess that takes the schema field order into account. Performance loss in that case is only around 10%.

Source files to look at:

  • data_types.rs - RBWNAT/Native data types strings parsed as proper AST. See the tests for more output examples, like Variant data type parsing.
  • rbwnat.rs - main integration tests with tons of scenarios. Some scenarios will require insert to be implemented as well, so that's a follow-up.
  • validation.rs - validating Serde calls against the data types provided in the RBWNAT format header.

Current benchmarks results

Select numbers

This branch:

compress  validation    elapsed  throughput  received
    none    disabled     1.610s  2370 MiB/s  3815 MiB
    none     enabled     2.655s  1437 MiB/s  3815 MiB

Main branch:

compress  elapsed  throughput  received
    none   1.296s  2943 MiB/s  3815 MiB

Selecting a ton of records from system.numbers is strictly worse. It is still not clear to me why, cause validation can be simply disabled, and the code is more or less the as in the main, adding only one boolean check.

However...

NYC taxi data

This branch after ccfac33:

compress  validation    elapsed  throughput  received  access
    none    disabled  556.938ms   609 MiB/s   339 MiB     seq
    none     enabled  866.077ms   392 MiB/s   339 MiB     seq
    none     enabled  947.789ms   358 MiB/s   339 MiB     map
     lz4    disabled  669.376ms   507 MiB/s   151 MiB     seq
     lz4     enabled  994.901ms   341 MiB/s   151 MiB     seq
     lz4     enabled     1.103s   307 MiB/s   151 MiB     map

Main branch (using version from #227):

compress  elapsed  throughput  received
    none  939.392ms   361 MiB/s   339 MiB
     lz4  943.551ms   360 MiB/s   151 MiB

Issues overview

Note

If an issue is checked in the list, that means there is also a test that demonstrates proper error messages in case of schema mismatch.

Resolved issues

Related issues

Previously closed issues with unclear error messages

Follow-up issues

@mshustov mshustov requested review from Copilot and loyd May 20, 2025 12:36
Copilot

This comment was marked as resolved.

Copy link
Contributor Author

@slvrtrn slvrtrn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a few comments regarding the intermediate implementation.

@slvrtrn slvrtrn changed the title PoC (Query): RowBinaryWithNamesAndTypes for enchanced type safety feat (query): RowBinaryWithNamesAndTypes for enchanced type safety May 29, 2025
@slvrtrn slvrtrn changed the title feat (query): RowBinaryWithNamesAndTypes for enchanced type safety feat(query): RowBinaryWithNamesAndTypes for enchanced type safety May 29, 2025
fn deserialize_bool<V: Visitor<'data>>(self, visitor: V) -> Result<V::Value> {
if Validator::VALIDATION {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it not removed for empty validate()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe this is not needed indeed, if everything inlines properly..

fn deserialize_unit<V: Visitor<'data>>(self, visitor: V) -> Result<V::Value> {
// TODO: revise this.
// TODO - skip validation?
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, it seems that it can break validation? Shouldn't we return Unsupported for this case?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have no idea what is the use case for this method, TBH :)

join_panic_schema_hint(&columns),
);
}
AccessType::WithSeqAccess // ignored
Copy link
Collaborator

@loyd loyd Jun 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of (Row, T) it's impossible to get WithMapAccess, can be considered as a bug

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will address in a follow-up

@slvrtrn slvrtrn changed the title feat(query): RowBinaryWithNamesAndTypes feat!(query): RowBinaryWithNamesAndTypes Jun 23, 2025
@slvrtrn slvrtrn merged commit 466b873 into main Jun 23, 2025
6 checks passed
@aaronvg
Copy link

aaronvg commented Jun 26, 2025

I know this was just merged, but is there an expected date on when this would be released?

@slvrtrn
Copy link
Contributor Author

slvrtrn commented Jun 27, 2025

@aaronvg, no estimations now, sorry. At the very least, we need to finalize and merge #244, and probably a few smaller follow-ups, then we will be ready to release a public beta.

@sxlijin
Copy link

sxlijin commented Jul 17, 2025

Clearer error messages and panics when a schema mismatch is detected.

Hi: as a user this is very concerning. Is it possible to add deserialization failure as a separate error variant on the fetch_* methods? Otherwise we have no way of handling this, and we're running into this regularly enough during our development that I fully expect to hit this in prod.

@slvrtrn
Copy link
Contributor Author

slvrtrn commented Jul 18, 2025

Hi: as a user this is very concerning. Is it possible to add deserialization failure as a separate error variant on the fetch_* methods? Otherwise we have no way of handling this, and we're running into this regularly enough during our development that I fully expect to hit this in prod.

we can do that, but isn't it against the Rust book itself? https://doc.rust-lang.org/book/ch09-03-to-panic-or-not-to-panic.html

That's a classic case of data corruption, no?
What is your plan on recovering if your struct is defined incorrectly in the code? So if you hit it on dev, then you will know how to exactly fix this, as panic will say what is wrong with the struct definition.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants