-
Notifications
You must be signed in to change notification settings - Fork 14.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add deferred pagination mode to GenericTransfer #44809
base: main
Are you sure you want to change the base?
Conversation
Following dependency check is failing in breeze:
@eladkal @potiuk This error is logical, as I needed to add the common sql provider dependency as the GenericTransfer needs this dependency due to the newly introduced SQLExecuteQueryTrigger used to allow the deferred paging mechanism. But after some reflection, it still feels unlogical to me that the GenericTransfer operator is part of the standard provider package, unless it allows more than just transferring data from database to database? If not, it would be more logical it resides in the common sql provider or I'm missing something? I've been going through the code, and checked implementations of the get_records and insert_rows method, which where all implemented by a Hook extending the DbApiHook, but I suspect the DbApiHook was introduced after the GenericTransfer already existed. |
Hello @potiuk @eladkal could you check my above question whether it makes sense or not? Thx |
Absolultely. It should be added to |
@potiuk Okay but this would then have an impact on imports no? Or would you keep same structure as is and move the GenericTransfer from standard providers to common sql? |
Generic Transfer has only been moved to "standard" provider recently as part of the preparation for Airflow 3. And the "standard" provider is not YET released in a The only back-compatibiity issue is that the old generic transfer should be redirected in Airflow 3 - but we can simply redirect it to the new place in common.sql, no problem with it whatsoever: |
…d reads (in deferred mode) and introduce a SQLExecuteQueryTrigger
…allows you to run deferrable operators in test common test utils
…if hook is instance of DbApiHook instead of checking presence of get_records and insert_rows method
fa178af
to
59eae35
Compare
…classes of the airflow operators module
@potiuk @eladkal I still have breeze test failing for selected tests on main (why do we only test on main branch?) since I moved the generic transfer to common sql, to me the expected values seems correct, but apparently I still get common.sql as additional provider for the bask operator, which is weird as standard provider doesn't need that dependency anymore since I moved the generic transfer, or is it because we run it against main branch, but then why?
|
The root cause is here: https://github.com/apache/airflow/actions/runs/12743121090/job/35512486483?pr=44809 You need to run pre-commit with your change to regenerate .json file where we keep dependencies cross-providers. |
Ok my bad, how many times didn't I already forgot that one 😆 Thx @potiuk |
I actually never remember about it as well. I simply run |
# Conflicts: # providers/src/airflow/providers/standard/operators/generic_transfer.py # providers/tests/microsoft/conftest.py
As explained in my Airflow medium blogpost, I've refactored the GenericTransfer to support deferred paginated reads.
When dealing with large datasets, not the whole dataset needs to be read into memory first before persisting it afterwards, as this could otherwise lead to out of memory errors on the worker executing the code.
I also took the opportunity to introduce an SQLExecuteQueryTrigger in the common sql provider, allowing the GenericTransfer to handle the paginated reads in deferred mode, so that the paginated reads can be decoupled from the writes, which shouldn't continuously block the worker as it can offload the reads to the triggerer while persisting the previous page in the meantime.
Once the dialects PR is done, we could improve the way how the GenericTransfer handles the paginated SQL queries across different databases. At this moment the paginated SQL query can be customized through the paginated_sql_statement_format parameter. The read size can be specified through the chunck_size parameter, maybe another (better) name could be preferred here but that I let you guy's decide how it's best named. If no chunk_size is specified, then the original implementation is used and everything is read and persisted in one go.
Last but not least, I've moved the test code to test deferrable operators out of the microsoft azure provider and put it into the common test utils, so it can be re-used across multiple modules.
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rst
or{issue_number}.significant.rst
, in newsfragments.