-
Notifications
You must be signed in to change notification settings - Fork 581
Description
When doing cleanup, we need to list all files to find untracked files. When we have lots of data files, this can be very slow.
The only portable way to get parallelism in this is to divide data up into directories.
For example, imagine you have 1 million lance files in the data prefix in S3. To list all these, you need to do 1,000 S3 list calls, sequentially. If we put them into 1000 unique prefixes, we could do 1 list call to get the prefixes, and then 100 parallel list calls at time, getting all results in something like 20% of the time.
Data files can have their naming scheme adjusted by changing:
Deletion files also matter, as I suspect we can create a lot of them. Each time a delete, update, or merge-insert runs, we can create these. So datasets with lots of updates will have many more deletion files than data files. Unfortunately, right now the file name is defined in the spec:
Lines 424 to 425 in 964ac0e
| // The path of the deletion file is constructed as: | |
| // {root}/_deletions/{fragment_id}-{read_version}-{id}.{extension} |
🤦 Yeah, it was I who did that... We'll need a spec change to make that hapen.