Skip to content

Conversation

stevenschlansker
Copy link
Contributor

@stevenschlansker stevenschlansker commented Jul 15, 2025

What does this PR do?

Introduce alternate "compact" row encoding that better uses knowledge of fixed-size types and sacrifices alignment to save space.

Introduce new Builder pattern to avoid explosion of Encoders static methods as more features are added to row format.

Optimizations include:

  • struct stores fixed-size fields (e.g. Int128. FixedSizeBinary) inline in fixed-data area without offset + size
  • struct of all fixed-sized fields is itself considered fixed-size to store in other struct or array
  • struct skips null bitmap if all fields are non-nullable
  • struct sorts fields by fixed-size for best-effort (but not guaranteed) alignment
  • struct can use less than 8 bytes for small data (int, short, etc)
  • struct null bitmap stored at end of struct to borrow alignment padding if possible
  • array stores fixed-size fields inline in fixed-data area without offset+size
  • array header uses 4 bytes for size (since Collection and array are only int-sized) and leaves remaining 4 bytes for start of null bitmap

Fixups include:

  • toString better handles varbinary / fixed-binary (hex dump of first 256 bytes)
  • start making Javadoc for row format

Compromises:

  • less alignment could increase access time, but this is opt-in. and I think on modern processors it is not such a big deal.
  • increased complexity of offset lookup, try to pre-compute in an array when possible and use StableValue when it is GA

Not compatible with existing row format.

Related issues

Fixes #2337

Does this PR introduce any user-facing change?

New API for new Compact codec. Existing codec unchanged.

@stevenschlansker stevenschlansker added enhancement New feature or request java labels Jul 15, 2025
@stevenschlansker stevenschlansker marked this pull request as draft July 15, 2025 03:48
@stevenschlansker stevenschlansker force-pushed the compact-codec branch 13 times, most recently from 6f485ec to 899df1d Compare August 25, 2025 18:55
@stevenschlansker stevenschlansker force-pushed the compact-codec branch 11 times, most recently from e5d1d66 to 6a66f86 Compare August 29, 2025 00:01
@stevenschlansker stevenschlansker force-pushed the compact-codec branch 3 times, most recently from 21359f0 to 2a3aab8 Compare September 8, 2025 18:25
@stevenschlansker stevenschlansker force-pushed the compact-codec branch 4 times, most recently from fc22170 to 605ca8e Compare September 16, 2025 16:27
@stevenschlansker stevenschlansker marked this pull request as ready for review September 16, 2025 16:43
@stevenschlansker stevenschlansker changed the title Draft: introduce Compact Row Codec introduce Compact Row Codec Sep 16, 2025
@stevenschlansker stevenschlansker changed the title introduce Compact Row Codec feat(java): introduce Compact Row Codec Sep 16, 2025
@stevenschlansker
Copy link
Contributor Author

Hi @chaokunyang , I believe this PR is ready for review now. No rush, I know it is a lot to go through. Happy to explain or change anything that does not make sense.

We run this now in our testing environment successfully with about a 30% - 50% space savings on our datasets. Will be taking compact codec to production over the coming weeks.

@chaokunyang
Copy link
Collaborator

@stevenschlansker Great! 30% - 50% space savings is a huge gain! I will review this PR ASAP, and release it in 0.13.0

@stevenschlansker stevenschlansker force-pushed the compact-codec branch 2 times, most recently from de32964 to 9033f17 Compare September 18, 2025 02:13
}

@Override
protected int isNullBitmapOffset() {
Copy link
Collaborator

@chaokunyang chaokunyang Sep 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a method return boolean, how abouting naming it as nullabilityBitmapOffset/nullBitmapOffset

import org.apache.fory.memory.MemoryBuffer;
import org.apache.fory.reflect.TypeRef;

interface CodecFormat {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will the name Encoding better align with Encoder?

Copy link
Collaborator

@chaokunyang chaokunyang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall, I left some minor comments. And could we also add some documents to row_format_guide.md in later PR.

Another thing for compact is to inline string. For string less that 7 bytes, we can put it into fixed-size region. Many string are less than 7 bytes, I think this can also brings some space savings

* Encode to a buffer without an embedded size. Variants with embedded size are not compatible.
* Returns number of bytes written to the buffer.
*/
int bareEncode(MemoryBuffer buffer, T obj);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to merge bareEncode and encode into one method?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is only called on map/struct/array, merge into one method and use a branch don't introduce lots of runtime cost, but we can make the interface more simple

import org.apache.fory.memory.MemoryBuffer;
import org.apache.fory.memory.MemoryUtils;

class BufferResettingMapEncoder<T> implements MapEncoder<T> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some documents to clarify the situations when we need to use this and the difference between this and BaseMapEncoder

import org.apache.fory.memory.MemoryBuffer;
import org.apache.fory.memory.MemoryUtils;

class BaseMapEncoder<M> implements MapEncoder<M> {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BaseXXX looks like an abstract parent class, could we use a different name?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about BinaryMapEncoder to relfect the relationship with BinaryRow, BinaryArray naming in codebase

similiar naming for other encoders

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And since we have different kinds of encoder for same type, we'd better to add some javadoc to document the distinctions

@stevenschlansker
Copy link
Contributor Author

Thank you for your review, I will make changes in the coming days.

compact encoding will store fixed sized fields inline,
relaxes alignment considerations to preserve alignment where possible
but using space more effectively
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request java
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Row format better data packing
2 participants