Skip to content

[timelapse / a2] Support large file uploads for ETL script access #106

Open
@rudokemper

Description

@rudokemper

Feature Request

As previously noted here and here, we need to establish a process for uploading very large archive files to cloud storage and enabling access to those files by the Timelapse and Auditor2 ETL scripts (which does not exist yet - #97).

Using the file input in a Windmill app will not do for large files, so we'll likely need to handle uploads outside of Windmill. For example, we could upload files to Azure Blob Storage and then use the Python Azure Storage SDK within the scripts to access the files.

At this point, it's unclear whether support for Azure alone will be sufficient, or whether we should aim to support other cloud storage providers (e.g., S3, GCS) as well. The workflow should ideally be flexible enough to accommodate different backends if needed. But given that CMI is likely to be the only user of these scripts in the short-to-medium term, it's fine to start with Azure only.

In terms of the actual workflow: do we just upload files to blob storage manually, and provide whatever path and credentials are needed for the Python SDK when running or scheduling a script in Windmill? @IamJeffG I'm interested to learn what you've done in the past (feel free to just link to a code snippet if that's easiest).

To close this issue, let's:

  • Agree upon the workflow
  • Modify the scripts accordingly
  • Provide necessary documentation

Metadata

Metadata

Assignees

Labels

connectorsConnector scripts for ETL from upstream data sourcesfeatureNew specs for new behavior

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions