Skip to content

Lance Format Cache Objects #1565

@maxweriz

Description

@maxweriz

Is your feature request related to a problem? Please describe.

The cache formats right now don’t schema enforce their content and pickle and pq are relatively expensive to write and read. The most efficient is likely json but that also has its downsides and can be extremely slow for point lookups. This is extremely important for serialization and deserialization in vector settings which is going to be, imo, most use cases in ML now-a-days (or should be). None of these are optimized for zero copy or non contiguous vector use cases.

Describe the solution you'd like

I’d be interested if there is a way to on the fly schematize data into Arrow and write it to Lance format for faster lookups. The cost here is additional storage space (compared to pq and json) for faster deserialization, point lookups, and mem mapping. I’m particularly interested in zero copy i/o of objects like numpy arrays.

Describe alternatives you've considered

I’ve also considered that https://fory.apache.org/blog/fury_blazing_fast_multiple_language_serialization_framework/ could be a good option. That multi language support would be great to have given defining the following workflow

  1. Define DAG in python with Hamilton
  2. Have DAG steps call PyO3 rust bindings to optimized feature generation
  3. Have DAG cache the outputs via Lance or Fury
  4. Potentially read from that cache in Rust, cpp, etc…. for features needed during inference/deployment

Additional context
Add any other context or screenshots about the feature request here.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions