Lance Format Cache Objects

**Is your feature request related to a problem? Please describe.**

The cache formats right now don’t schema enforce their content and pickle and pq are relatively expensive to write and read. The most efficient is likely json but that also has its downsides and can be extremely slow for point lookups. This is extremely important for serialization and deserialization in vector settings which is going to be, imo, most use cases in ML now-a-days (or should be). None of these are optimized for zero copy or non contiguous vector use cases.

**Describe the solution you'd like**

I’d be interested if there is a way to on the fly schematize data into Arrow and write it to Lance format for faster lookups. The cost here is additional storage space (compared to pq and json) for faster deserialization, point lookups, and mem mapping. I’m particularly interested in zero copy i/o of objects like numpy arrays.

**Describe alternatives you've considered**

I’ve also considered that https://fory.apache.org/blog/fury_blazing_fast_multiple_language_serialization_framework/ could be a good option. That multi language support would be great to have given defining the following workflow

1. Define DAG in python with Hamilton
2. Have DAG steps call PyO3 rust bindings to optimized feature generation
3. Have DAG cache the outputs via Lance or Fury
4. Potentially read from that cache in Rust, cpp, etc…. for features needed during inference/deployment

**Additional context**
Add any other context or screenshots about the feature request here.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lance Format Cache Objects #1565

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Lance Format Cache Objects #1565

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions