Skip to content

Add config option to preserve null values for collections#518

Merged
absurdfarce merged 4 commits into1.xfrom
issue507
Jan 28, 2026
Merged

Add config option to preserve null values for collections#518
absurdfarce merged 4 commits into1.xfrom
issue507

Conversation

@absurdfarce
Copy link
Collaborator

@absurdfarce absurdfarce commented Dec 21, 2025

The default behaviour for the Java driver is to convert null or empty values for the bytes associated with a collection to a Java type... see here for an example. This behaviour is implemented within the codec layer of the Java driver meaning that by the time the data reaches dsbulk it's already been converted... so dsbulk has no means to distinguish between an empty collection generated in this way and a legit empty collection.

This PR adds a config option which loads a custom codec for collection types. This custom codec simply returns an actual null value if null bytes (or empty bytes) are observed by the codec in the decode process. In all other cases the default behaviour of the codec is preserved.

I've included a unit test for this functionality as well but the following manual test should be enough to demonstrate the issue (and the results of this fix):

CREATE TABLE test.baz (
 i int PRIMARY KEY,
 j text,
 k map<ascii,ascii>,
 l frozen<map<ascii,ascii>>
 );

insert into test.baz (i,j,k,l) values (1,'one',{},{});
insert into test.baz (i,j,k,l) values (2,'two',null,null);
insert into test.baz (i,j,k,l) values (3,'three',{'3': 'three'},{'3': 'three'});
$ bin/dsbulk unload -k test -t baz -c json 2> /dev/null
{"i":1,"j":"one","k":{},"l":{}}
{"i":2,"j":"two","k":{},"l":{}}
{"i":3,"j":"three","k":{"3":"three"},"l":{"3":"three"}}
$ bin/dsbulk unload --dsbulk.codec.allowNullCollections=true -k test -t baz -c json 2> /dev/null
{"i":1,"j":"one","k":null,"l":{}}
{"i":2,"j":"two","k":null,"l":null}
{"i":3,"j":"three","k":{"3":"three"},"l":{"3":"three"}}

Seeing some strange results when adding test values manually via cqlsh.  Presume this is a Python driver
issue but that isn't especially relevant for this issue.
@absurdfarce absurdfarce linked an issue Dec 21, 2025 that may be closed by this pull request
@absurdfarce
Copy link
Collaborator Author

Note that there's an interesting question here about why this CQL:

insert into test.baz (i,j,k,l) values (1,'one',{},{});

results in this return value from dsbulk:

{"i":1,"j":"one","k":null,"l":{}}

That's pretty clearly wrong, but I don't think it's a dsbulk error. I'm adding these test values via cqlsh and I see the same results when I query the tables via cqlsh so I'm pretty sure there's a Python driver problem lurking there somewhere. Regardless it pretty clearly isn't a dsbulk issue.

@absurdfarce
Copy link
Collaborator Author

Ping @adutra for review on this one as well

schemaSettings.isAllowExtraFields(), schemaSettings.isAllowMissingFields());
schemaSettings.isAllowExtraFields(),
schemaSettings.isAllowMissingFields(),
codecSettings.allowsNullCollections());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be an inconsistency in naming here for boolean properties: isAllow vs allows.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch, I'll clean this up.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be fixed now.

return builder.build();
}

public boolean allowsNullCollections() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder: doesn't this property belong in SchemaSettings rather than CodecSettings?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's funny, cause I originally had it in SchemaSettings and subsequently moved it. My thinking in doing so was that you're actually modifying the behviour of the codec i.e. changing how it interprets null values in it's responses. So in the end I thought CodecSettings seemed like a more natural home. After making the move it felt more right to me; the other entries in SchemaSettings didn't really seem to match up to what was going on with this config.

Happy to discuss if you think SchemaSettings seems more appropriate.

@absurdfarce absurdfarce requested a review from adutra January 28, 2026 18:20
@absurdfarce absurdfarce added this to the 1.11.1 milestone Jan 28, 2026
@absurdfarce absurdfarce merged commit c411bea into 1.x Jan 28, 2026
1 check failed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Frozen field is exporting as {}, instead of null

2 participants