Skip to content

Comments

[WIP] enable producing DAG cars in inline mode (attached-storage approach)#468

Closed
parkan wants to merge 4 commits intomainfrom
feat/dag-emit
Closed

[WIP] enable producing DAG cars in inline mode (attached-storage approach)#468
parkan wants to merge 4 commits intomainfrom
feat/dag-emit

Conversation

@parkan
Copy link
Collaborator

@parkan parkan commented Mar 26, 2025

this is the absolute minimal implementation that fixes #446

how this works:

  • we modify inline preps to accept an output, which is only going to be used for the DAG, and change the validation logic to require such an output unless --no-dag is specified
  • daggen will now automatically produce the DAG car since the conditonal car generation simply checks for the presence of a storage writer
  • we modify pack's car generation check to depend on the status of --no-inline and skip writing the data cars despite an output storage being attached, preserving the previous behavior

the DAG car is then available in the output storage:

$ ./singularity storage create local --path=/tmp/singularity
ID  Name         Type   Path              
1   hurt_galaxy  local  /tmp/singularity  
$ ./singularity storage create local --path=/tmp/dag
ID  Name        Type   Path      
2   light_hail  local  /tmp/dag  
$ ./singularity prep create --source=1 --output=2 --no-dag=false --no-inline=false
ID  Name             DeleteAfterExport  MaxSize      PieceSize    NoInline  NoDag  
1   adorable_forest  false              33822867456  34359738368  false     false  
    Source Storages:
        ID  Name         Type   Path              
        1   hurt_galaxy  local  /tmp/singularity  
    Output Storages:
        ID  Name        Type   Path      
        2   light_hail  local  /tmp/dag  
$ ./singularity prep start-scan 1 1
ID  Type  State  ErrorMessage  WorkerID  
1   scan  ready                <nil>     
$ ./singularity prep start-pack 1 1
$ ./singularity prep start-daggen 1 1
ID  Type    State  ErrorMessage  WorkerID  
2   daggen  ready                <nil>     
$ ./singularity run dataset-worker --exit-on-complete
// snip
$ ls -lh /tmp/dag
total 180K
-rw-r--r--. 1 arkadiy arkadiy 180K Mar 26 14:06 baga6ea4seaqp25dcxeyhswdjvest5qinfy2y2txj4h7k35nronoha4xn2mh74oi.car

as well as under the pieces for the prep:

$ ./singularity prep list-pieces 1
AttachmentID  SourceStorageID  
1             1                
    SourceStorage
        ID  Name         Type   Path              
        1   hurt_galaxy  local  /tmp/singularity  
    Pieces
        PieceCID                                                          PieceSize    RootCID                                                      FileSize   StoragePath                                                           
        baga6ea4seaqfdwcz4ts2kdg7a6xcvb7erkjxkcl2xjv7qckqsntnfgnkgvxyokq  34359738368  bafkreic7zndobkznnmagtmz2k7s7bm652zhxujzmivrakbr56phn35sqmq  147085548                                                                        
        baga6ea4seaqp25dcxeyhswdjvest5qinfy2y2txj4h7k35nronoha4xn2mh74oi  34359738368  bafybeichuteorojcgosy35mhsov6hhtk5merykvbpytpgiszx5whcv63nq  183689     baga6ea4seaqp25dcxeyhswdjvest5qinfy2y2txj4h7k35nronoha4xn2mh74oi.car

both can then be downloaded through singularity download and together (concatenated here for purposes of demonstration, but adding each one to ipfs with dag import) provide the necessary structure to get the files out, note that as previously the data car does not allow extracting anything meaningful from the root, while the dag by itself does not contain the data:

$ car extract --file baga6ea4seaqfdwcz4ts2kdg7a6xcvb7erkjxkcl2xjv7qckqsntnfgnkgvxyokq.car
no files extracted
$ car extract --file baga6ea4seaqp25dcxeyhswdjvest5qinfy2y2txj4h7k35nronoha4xn2mh74oi.car
data for entry not found: /.devcontainer/container-prompt.md (skipping...)
data for entry not found: /.devcontainer/devcontainer.json (skipping...)
data for entry not found: /.devcontainer/scripts/build.sh (skipping...)
$ car concat baga6ea4seaqp25dcxeyhswdjvest5qinfy2y2txj4h7k35nronoha4xn2mh74oi.car baga6ea4seaqfdwcz4ts2kdg7a6xcvb7erkjxkcl2xjv7qckqsntnfgnkgvxyokq.car > out.car
$ car extract --file out.car 
extracted 2176 file(s)

as an outcome from this exercise we immediately have some significant questions:

  • is retrieval possible on the SP side when DAG crosses piece boundaries? this is a concern that affects all data prepared with singularity (i.e. prior to this patch) and in fact all filecoin data where content DAGs don't align with pieces, e.g. any file > piece size, therefore I would expect it to work, but we should verify
  • the obvious question of whether we want to write the DAG cars to storage; I personally like this as it would allow us to aggregate the (much smaller) DAG cars and then seal them as a bundle
  • the corresponding question of how the DAG piece should fit into the prep; right now it's listed as essentially any other piece (and in fact has the full 32GiB PieceSize, which I suspect happens with normal preps as well)

as well as some smaller ones:

  • scan/pack/daggen dependency order appears to be incorrectly observed by the scheduler so if all 3 types of tasks are scheduled at once the daggen will error out and require a separate start call after the other two complete
  • it appears that start-scan implies start-pack rather than vice versa, similar issue to above
  • tests etc

Arkadiy Kukarkin added 4 commits September 19, 2024 15:06
- refactor validation logic for output storage requirements in CreateRequest and RemoveOutputStorageHandler
- update pack check logic to skip data car writing even with output storages attaches
- update error message
@parkan
Copy link
Collaborator Author

parkan commented Mar 26, 2025

a bit more on why I feel OK having the dag cars written to storage even in inline case: basically they are very, very small compared to the data/leaf cars!

in this case I have deliberately chosen a "bad" input (the singularity repo) which is only ~146M but has 2K+ files in many deeply nested subdirs (git objects) and therefore a relatively large data tree, but the DAG car is only 180K (~0.1%)

@parkan
Copy link
Collaborator Author

parkan commented Mar 28, 2025

closing in favor of #472

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug]: label field does not point to a root cid and no DAG present in the car when using inline prep

2 participants