Put data and deletion files in subdirectories

When doing cleanup, we need to list all files to find untracked files. When we have lots of data files, this can be very slow.

The only portable way to get parallelism in this is to divide data up into directories.

For example, imagine you have 1 million lance files in the `data` prefix in S3. To list all these, you need to do 1,000 S3 list calls, sequentially. If we put them into 1000 unique prefixes, we could do 1 list call to get the prefixes, and then 100 parallel list calls at time, getting all results in something like 20% of the time.

Data files can have their naming scheme adjusted by changing:

https://github.com/lance-format/lance/blob/2a08ec9daa55198fe64ccd13ede2fead09261c71/rust/lance/src/dataset/fragment/write.rs#L43

Deletion files also matter, as I suspect we can create a lot of them. Each time a delete, update, or merge-insert runs, we can create these. So datasets with lots of updates will have many more deletion files than data files. Unfortunately, right now the file name is defined in the spec:

https://github.com/lance-format/lance/blob/964ac0ec92a4dafe9579fe38fd504e9bfc73f48a/protos/table.proto#L424-L425

🤦 Yeah, it was I who did that... We'll need a spec change to make that hapen.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Put data and deletion files in subdirectories #6030

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	// The path of the deletion file is constructed as:
	// {root}/_deletions/{fragment_id}-{read_version}-{id}.{extension}

Put data and deletion files in subdirectories #6030

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions