-
Notifications
You must be signed in to change notification settings - Fork 307
feat(java): introduce Compact Row Codec #2414
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
3ec5b0c
to
3a65e04
Compare
7b4586c
to
dd9bd6f
Compare
6f485ec
to
899df1d
Compare
e5d1d66
to
6a66f86
Compare
21359f0
to
2a3aab8
Compare
fc22170
to
605ca8e
Compare
605ca8e
to
8ed4400
Compare
Hi @chaokunyang , I believe this PR is ready for review now. No rush, I know it is a lot to go through. Happy to explain or change anything that does not make sense. We run this now in our testing environment successfully with about a 30% - 50% space savings on our datasets. Will be taking compact codec to production over the coming weeks. |
@stevenschlansker Great! 30% - 50% space savings is a huge gain! I will review this PR ASAP, and release it in |
de32964
to
9033f17
Compare
} | ||
|
||
@Override | ||
protected int isNullBitmapOffset() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this looks like a method return boolean
, how abouting naming it as nullabilityBitmapOffset/nullBitmapOffset
import org.apache.fory.memory.MemoryBuffer; | ||
import org.apache.fory.reflect.TypeRef; | ||
|
||
interface CodecFormat { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the name Encoding
better align with Encoder
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall, I left some minor comments. And could we also add some documents to row_format_guide.md
in later PR.
Another thing for compact is to inline string. For string less that 7 bytes, we can put it into fixed-size region. Many string are less than 7 bytes, I think this can also brings some space savings
* Encode to a buffer without an embedded size. Variants with embedded size are not compatible. | ||
* Returns number of bytes written to the buffer. | ||
*/ | ||
int bareEncode(MemoryBuffer buffer, T obj); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible to merge bareEncode
and encode
into one method?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is only called on map/struct/array, merge into one method and use a branch don't introduce lots of runtime cost, but we can make the interface more simple
import org.apache.fory.memory.MemoryBuffer; | ||
import org.apache.fory.memory.MemoryUtils; | ||
|
||
class BufferResettingMapEncoder<T> implements MapEncoder<T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add some documents to clarify the situations when we need to use this and the difference between this and BaseMapEncoder
import org.apache.fory.memory.MemoryBuffer; | ||
import org.apache.fory.memory.MemoryUtils; | ||
|
||
class BaseMapEncoder<M> implements MapEncoder<M> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BaseXXX
looks like an abstract parent class, could we use a different name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about BinaryMapEncoder
to relfect the relationship with BinaryRow, BinaryArray naming in codebase
similiar naming for other encoders
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And since we have different kinds of encoder for same type, we'd better to add some javadoc to document the distinctions
Thank you for your review, I will make changes in the coming days. |
compact encoding will store fixed sized fields inline, relaxes alignment considerations to preserve alignment where possible but using space more effectively
9033f17
to
fc511d6
Compare
What does this PR do?
Introduce alternate "compact" row encoding that better uses knowledge of fixed-size types and sacrifices alignment to save space.
Introduce new Builder pattern to avoid explosion of
Encoders
static methods as more features are added to row format.Optimizations include:
Fixups include:
Compromises:
StableValue
when it is GANot compatible with existing row format.
Related issues
Fixes #2337
Does this PR introduce any user-facing change?
New API for new Compact codec. Existing codec unchanged.