Kafka Connect: Support VARIANT when record convert#15283
Kafka Connect: Support VARIANT when record convert#15283seokyun-ha-toss wants to merge 4 commits intoapache:mainfrom
Conversation
…ersion methods for nested structures
| return convertLocalDateTime(value); | ||
| } | ||
|
|
||
| protected Variant convertVariantValue(Object value) { |
There was a problem hiding this comment.
Is there a reason this is protected? If it is intended to be overridden it would be good to document the contract that implementors must follow (and what the default implementation does).
I guess other methods in this class don't do that, but seems like maybe we should start?
There was a problem hiding this comment.
I followed the existing pattern of the convertXXX() methods, which are all defined as protected. For consistency, I kept the same visibility here.
Please let me know if you think it should be adjusted.
| if (value instanceof Map) { | ||
| Map<?, ?> map = (Map<?, ?>) value; | ||
| List<String> names = Lists.newArrayList(); | ||
| map.keySet().stream().map(Object::toString).forEach(names::add); |
There was a problem hiding this comment.
Why not do the iteration once by visiting the entry set below?
There was a problem hiding this comment.
This also seems more consistent with the loop below on line 509
| return convertLocalDateTime(value); | ||
| } | ||
|
|
||
| protected Variant convertVariantValue(Object value) { |
There was a problem hiding this comment.
I'm not sure if it is generally done in this project but it might be good to consider having protection against stack overflow on recursion
There was a problem hiding this comment.
If we're recursing through the schema/object structure, we don't assume the need to protect against stack overflow (either you can traverse the structure or not). I guess there may be some way to construct an object that will cycle, but that's not typical with how the object mapping works in kafka (it would probably fail upstream of this point anyway).
There was a problem hiding this comment.
Good point. However, as @danielcweeks mentioned, actual Kafka values don't have cycles, so the recursion ends at some point.
In my opinion, the stack overflow concern is unavoidable when traversing arbitrarily nested values. It depends on the size and depth of the input data, and in many cases we cannot control it.
Also, some methods (convertValue(), convertStructValue(), convertListValue()) already work recursively.
There was a problem hiding this comment.
My concern was more around, very deeply nested values (not necessarily infinite recursion), but if we don't check these in the project, then I don't have concern for not doing so here
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/RecordConverter.java
Outdated
Show resolved
Hide resolved
| * Collects all field names (map keys) from the entire object tree. Used to build a single | ||
| * VariantMetadata for the whole Variant (required for nested maps). | ||
| */ | ||
| private static List<String> collectFieldNames(Object value) { |
There was a problem hiding this comment.
it seems like you can maybe avoid some object churn by passing through a Set that gets elements added to as it goes down on recursion?
There was a problem hiding this comment.
+1 to @emkornfield's comment. We should avoid navigating the object multiple times.
There was a problem hiding this comment.
Great! I'll apply your feedbacks! Thanks!
There was a problem hiding this comment.
I'm not sure if two pass is strictly avoidable as it looks like we are trying to sort keys globally?
| if (value instanceof byte[]) { | ||
| return Variants.of(ByteBuffer.wrap((byte[]) value)); | ||
| } | ||
| if (value instanceof BigDecimal) { |
There was a problem hiding this comment.
I think this can be folded underneath Number handling?
| if (num.doubleValue() == num.longValue()) { | ||
| return Variants.of(num.longValue()); | ||
| } | ||
| return Variants.of(num.doubleValue()); |
There was a problem hiding this comment.
It seems safer to throw an exception here, otherwise we potentially have data loss (or at least this should be optional)?
There was a problem hiding this comment.
Or we should convert it to BigDecimal.
There was a problem hiding this comment.
I agree with failing if it's not a known type. We shouldn't try to massage numeric values if we don't know exactly what is being represented.
| assertThat(updateMap.get("st.ff").type()).isInstanceOf(DoubleType.class); | ||
| } | ||
|
|
||
| @Test |
There was a problem hiding this comment.
I think it might be more maintainable to write tests against convertVariantValue instead of the static helpers.
Also it seems like there should be tests for convertVariantValue specifically?
kafka-connect/kafka-connect/src/main/java/org/apache/iceberg/connect/data/RecordConverter.java
Outdated
Show resolved
Hide resolved
| if (value instanceof Map) { | ||
| Map<?, ?> map = (Map<?, ?>) value; | ||
| List<String> names = Lists.newArrayList(); | ||
| map.keySet().stream().map(Object::toString).forEach(names::add); |
There was a problem hiding this comment.
Variant keys must be strings, but an arbitrary map passed in through KC does not have that requirement, so we should protect against it.
Summary
Add support for converting arbitrary Java objects (e.g.
Map<String, Object>, lists, primitives) into Iceberg Variant type in the Kafka Connect RecordConverter. Nested maps and lists are converted recursively so that structures like{"user": {"name": "alice", "address": {"city": "Seoul", "zip": "12345"}}}are correctly represented as a single Variant.Motivation
Kafka Connect payloads often come as schema-less or JSON-like maps. To write them into Iceberg tables with a Variant column, the connector must convert these Java objects into the Variant format (metadata + value) and support nested maps/arrays without losing structure or key names.
Behaviour
{"a": 1, "b": "x"}["a", "b"], one ShreddedObject with two fields.{"user": {"name": "alice", "address": {"city": "Seoul", "zip": "12345"}}}Variant.from(ByteBuffer)where appropriate.Relates
Thanks, Good Day!