Skip to content

Put data and deletion files in subdirectories #6030

@wjones127

Description

@wjones127

When doing cleanup, we need to list all files to find untracked files. When we have lots of data files, this can be very slow.

The only portable way to get parallelism in this is to divide data up into directories.

For example, imagine you have 1 million lance files in the data prefix in S3. To list all these, you need to do 1,000 S3 list calls, sequentially. If we put them into 1000 unique prefixes, we could do 1 list call to get the prefixes, and then 100 parallel list calls at time, getting all results in something like 20% of the time.

Data files can have their naming scheme adjusted by changing:

https://github.com/lance-format/lance/blob/2a08ec9daa55198fe64ccd13ede2fead09261c71/rust/lance/src/dataset/fragment/write.rs#L43

Deletion files also matter, as I suspect we can create a lot of them. Each time a delete, update, or merge-insert runs, we can create these. So datasets with lots of updates will have many more deletion files than data files. Unfortunately, right now the file name is defined in the spec:

lance/protos/table.proto

Lines 424 to 425 in 964ac0e

// The path of the deletion file is constructed as:
// {root}/_deletions/{fragment_id}-{read_version}-{id}.{extension}

🤦 Yeah, it was I who did that... We'll need a spec change to make that hapen.

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions