Skip to content

feat(PoC): adjust file-based and file uploader component to latest protocol changes. #457

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Conversation

aldogonzalez8
Copy link
Contributor

@aldogonzalez8 aldogonzalez8 commented Apr 3, 2025

What

This PR updates the file-based and file uploader components in the Airbyte Python CDK to align with the file transfer record protocol changes introduced in the platform. It introduces schema refinements, file path handling improvements, and new test cases.

Resolves https://github.com/airbytehq/airbyte-internal-issues/issues/12364

How

  • Updated file transfer logic to reflect new protocol structure.
  • Adjusted the uploader to handle file references at the record selector level.
  • Performed refactoring and linting cleanup across modules.

Review guide

File based changes:

  1. airbyte_cdk/models/airbyte_protocol.py: remove hacked protocol
  2. airbyte_cdk/models/file_transfer_record_message.py: remove hacked protocol
  3. airbyte_cdk/sources/concurrent_source/concurrent_read_processor.py: remove hacked protocol
  4. airbyte_cdk/sources/file_based/file_based_stream_reader.py: change method verb and return type to AirbyteRecordMessageFileReference, also make _get_file_transfer_paths support method return a dict with path fields.
  5. airbyte_cdk/sources/file_based/file_record_data.py: helper model for record (metadata) of files.
  6. airbyte_cdk/sources/file_based/file_types/file_transfer.py: update to return record and file reference data.
  7. airbyte_cdk/sources/file_based/schema_helpers.py: schema of records (metadata) for file-based connectors.
  8. airbyte_cdk/sources/file_based/stream/concurrent/adapters.py: pass file_reference
  9. airbyte_cdk/sources/file_based/stream/default_file_based_stream.py: introduce changes to default file based stream to handle new file reference and records data besides fixed schema.
  10. airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py: update call to stream_data_to_airbyte_message
  11. airbyte_cdk/sources/types.py: remove old is_file_transfer_message flag
  12. airbyte_cdk/sources/utils/record_helper.py: remove handling of is_file_transfer_message flag
  13. airbyte_cdk/test/mock_http/response_builder.py: add helper method to get binary data from file for testing

File-api changes:

  1. airbyte_cdk/sources/declarative/retrievers/file_uploader.py: update latest protocol fields names.

User Impact

Developers using the file-based CDK and file uploader in declarative functionality will benefit from file_reference protocol support.

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Apr 3, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

@aldogonzalez8 aldogonzalez8 changed the title Aldogonzalez8/poc adjust file based cdk new protocol feat(PoC): adjust file-based cdk to new protocol Apr 4, 2025
@github-actions github-actions bot added the enhancement New feature or request label Apr 4, 2025
@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Apr 4, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

@aldogonzalez8 aldogonzalez8 changed the base branch from maxi297/poc-file-upload to main April 15, 2025 18:50
@aldogonzalez8 aldogonzalez8 changed the base branch from main to maxi297/poc-file-upload April 15, 2025 18:51
@aldogonzalez8
Copy link
Contributor Author

aldogonzalez8 commented Apr 15, 2025

I asked a couple of questions to make sure I understand everything but I don't think anything will be blocking.

Also in theory, this is a breaking change but it feels like we will update all the connectors that populates AirbyteFileTransferRecordMessage in record or call AbstractFileBasedStreamReader.get_file, right? Are there other breaking changes we should be worried about?

I think there are no other things to worry about, as previously, we had a schema, but we ignored such schema (which actually is/was messy) as the only point of interest was to move the file. From now on, we will move data along the file.

We will need to refresh the schema for these.

Old (messy,, but we didn't care):

image

New (need to update pre-dev as I'm adding source-uri but same idea):

image

@aldogonzalez8 aldogonzalez8 changed the base branch from maxi297/poc-file-upload to main April 15, 2025 19:15
@aldogonzalez8 aldogonzalez8 changed the base branch from main to maxi297/poc-file-upload April 15, 2025 19:16
from pydantic.v1 import BaseModel


class FileRecordData(BaseModel):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If allowing additional fields is difficult, I'd suggest adding an "additional_properties" field that sources can populate however they want

bytes: int
source_uri: str
id: Optional[str] = None
updated_at: Optional[str] = None
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also have a created_at?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, makes sense, I will add it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is now there

@aldogonzalez8 aldogonzalez8 requested a review from maxi297 April 16, 2025 14:40
@aldogonzalez8
Copy link
Contributor Author

I asked a couple of questions to make sure I understand everything but I don't think anything will be blocking.
Also in theory, this is a breaking change but it feels like we will update all the connectors that populates AirbyteFileTransferRecordMessage in record or call AbstractFileBasedStreamReader.get_file, right? Are there other breaking changes we should be worried about?

I think there are no other things to worry about, as previously, we had a schema, but we ignored such schema (which actually is/was messy) as the only point of interest was to move the file. From now on, we will move data along the file.

We will need to refresh the schema for these.

Old (messy,, but we didn't care):

image

New (need to update pre-dev as I'm adding source-uri but same idea):

image

Anyway, I will do some testing on how it works on a workspace with the previous version of the connectors and later receive the upgrade and update the schema as part of the E2E test we plan for late this week.

aaronsteers and others added 2 commits April 16, 2025 13:02
Co-authored-by: octavia-squidington-iii <[email protected]>
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Copy link
Contributor

@maxi297 maxi297 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm good with this change assuming the CAT changes were out of scope

@@ -0,0 +1,46 @@
# Copyright (c) 2024 Airbyte, Inc., all rights reserved.
'''FAST Airbyte Standard Tests
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have CATs tests related changes here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a terrible person and I'm taking credit for @aaronsteers work :( Sorry.

JK, it is because it is a stacked PR where both branches run behind main. Let me fix that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's fine! not a blocker for going forward with this

@aldogonzalez8 aldogonzalez8 merged commit b5b8197 into maxi297/poc-file-upload Apr 17, 2025
26 checks passed
@aldogonzalez8 aldogonzalez8 deleted the aldogonzalez8/poc-adjust-file-based-cdk-new-protocol branch April 17, 2025 15:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants