Skip to content

feat(file-api): new upload component for declarative cdk and update protocol support for file-based connectors #433

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 12 commits into from
Apr 22, 2025

Conversation

maxi297
Copy link
Contributor

@maxi297 maxi297 commented Mar 19, 2025

Context

This pr summarizes the changes in:

What

File API support and update the file-based sources to the latest protocol implementations.

Api Sources

Introduce a new file upload component.

File Bases Sources: Remove Legacy Hacked Protocol for file-based connectors and introduce latest protocol changes

This PR updates the file-based and file uploader components in the Airbyte Python CDK to align with the file transfer record protocol. It introduces schema refinements, file path handling improvements, and new test cases.

Resolves https://github.com/airbytehq/airbyte-internal-issues/issues/12364

How

  • Updated file transfer logic to reflect new protocol structure.
  • Adjusted the uploader to handle file references at the record selector level.
  • Performed refactoring and linting cleanup across modules.

Review guide

File-api changes:

  1. airbyte_cdk/sources/declarative/retrievers/file_uploader.py: newest cool component to upload documents for file API streams .

Remove Legacy Hacked Protocol for file based connectors and introduce latest protocol changes

File based changes:

  1. airbyte_cdk/models/airbyte_protocol.py: remove hacked protocol
  2. airbyte_cdk/models/file_transfer_record_message.py: remove hacked protocol
  3. airbyte_cdk/sources/concurrent_source/concurrent_read_processor.py: remove hacked protocol
  4. airbyte_cdk/sources/file_based/file_based_stream_reader.py: change method verb and return type to AirbyteRecordMessageFileReference, also make _get_file_transfer_paths support method return a dict with path fields.
  5. airbyte_cdk/sources/file_based/file_record_data.py: helper model for record (metadata) of files.
  6. airbyte_cdk/sources/file_based/file_types/file_transfer.py: update to return record and file reference data.
  7. airbyte_cdk/sources/file_based/schema_helpers.py: schema of records (metadata) for file-based connectors.
  8. airbyte_cdk/sources/file_based/stream/concurrent/adapters.py: pass file_reference
  9. airbyte_cdk/sources/file_based/stream/default_file_based_stream.py: introduce changes to default file based stream to handle new file reference and records data besides fixed schema.
  10. airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py: update call to stream_data_to_airbyte_message
  11. airbyte_cdk/sources/types.py: remove old is_file_transfer_message flag
  12. airbyte_cdk/sources/utils/record_helper.py: remove handling of is_file_transfer_message flag
  13. airbyte_cdk/test/mock_http/response_builder.py: add helper method to get binary data from file for testing

User Impact

Developers using the file-based CDK and file uploader in declarative functionality will benefit from file_reference protocol support.

Can this PR be safely reverted and rolled back?

  • YES 💚
  • NO ❌

Summary by CodeRabbit

Summary by CodeRabbit

  • New Features

    • Added support for file-based streams and file transfer in declarative sources with a new file uploader component and schema enhancements.
    • Enabled streams to be marked as file-based, exposing file metadata and references during sync and discovery.
    • Introduced detailed file metadata models and integrated file reference handling in records.
  • Bug Fixes

    • Simplified and unified file transfer handling across stream implementations.
  • Tests

    • Added extensive unit tests covering file-based streams, file uploader functionality, and stream metadata exposure.
    • Updated existing test scenarios to include file-based stream attributes.
  • Chores

    • Updated dependency versions to support new protocol features.

@maxi297
Copy link
Contributor Author

maxi297 commented Mar 19, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

@maxi297 maxi297 marked this pull request as draft March 19, 2025 16:19
Copy link
Contributor

coderabbitai bot commented Mar 19, 2025

📝 Walkthrough

Walkthrough

This update introduces a unified and extensible file transfer mechanism into the Airbyte CDK, primarily by integrating a new FileUploader declarative component and updating the file-based streaming architecture. The changes refactor how file transfer records are represented, replacing the previous boolean flag and dedicated message class with a structured AirbyteRecordMessageFileReference. The declarative source schema, models, and factory logic are extended to support file uploaders, allowing streams to fetch and store files with customizable naming and extraction logic. Test infrastructure and scenarios are updated to validate file-based streaming, file reference metadata, and the new declarative manifest capabilities.

Changes

File(s) / Path(s) Change Summary
airbyte_cdk/models/__init__.py Added AirbyteRecordMessageFileReference to the public API.
airbyte_cdk/models/airbyte_protocol.py Removed AirbyteFileTransferRecordMessage and simplified the AirbyteMessage dataclass to only use AirbyteRecordMessage for the record attribute.
airbyte_cdk/models/file_transfer_record_message.py Deleted the AirbyteFileTransferRecordMessage dataclass.
airbyte_cdk/sources/concurrent_source/concurrent_read_processor.py Changed the argument in stream_data_to_airbyte_message from is_file_transfer_message to file_reference.
airbyte_cdk/sources/declarative/concurrent_declarative_source.py Added logic to detect and propagate supports_file_transfer flag when grouping streams.
airbyte_cdk/sources/declarative/declarative_component_schema.yaml Added a new file_uploader schema definition under DeclarativeStream for experimental file fetching support.
airbyte_cdk/sources/declarative/extractors/record_selector.py Extended RecordSelector with an optional file_uploader attribute, invoking upload on each record if set.
airbyte_cdk/sources/declarative/models/declarative_component_schema.py Added the FileUploader Pydantic model and an optional file_uploader field to DeclarativeStream.
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py Added support for the FileUploader component in the factory, including new creation methods and argument propagation.
airbyte_cdk/sources/declarative/retrievers/file_uploader.py Introduced the FileUploader dataclass with logic for downloading, naming, and storing files, and updating records with file references.
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py Updated read to yield Record objects directly if already present, improving type handling.
airbyte_cdk/sources/file_based/file_based_stream_reader.py Refactored method get_file to upload with a new return type; added _get_file_transfer_paths and constants for file path management.
airbyte_cdk/sources/file_based/file_record_data.py Added the FileRecordData Pydantic model for structured file record metadata.
airbyte_cdk/sources/file_based/file_types/file_transfer.py Refactored to use upload instead of get_file, updated return types, and simplified local directory handling.
airbyte_cdk/sources/file_based/schema_helpers.py Updated file_transfer_schema to a detailed schema reflecting the new file record structure.
airbyte_cdk/sources/file_based/stream/concurrent/adapters.py Removed conditional logic for file transfer messages, always using file_reference for record creation, and deleted _use_file_transfer.
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py Centralized file transfer logic using the new upload method and file reference; removed redundant methods and added as_airbyte_stream.
airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py Removed the is_file_transfer_message argument from message creation.
airbyte_cdk/sources/streams/concurrent/default_stream.py Added supports_file_transfer parameter and exposed is_file_based in the Airbyte stream representation.
airbyte_cdk/sources/types.py Changed Record to use an optional file_reference instead of a boolean flag; added a property for access.
airbyte_cdk/sources/utils/files_directory.py Added a utility for determining the files directory, with fallback logic.
airbyte_cdk/sources/utils/record_helper.py Refactored stream_data_to_airbyte_message to accept an optional file_reference instead of a boolean flag, simplifying record message creation.
airbyte_cdk/test/mock_http/response_builder.py Added find_binary_response to load binary HTTP response files for tests.
pyproject.toml Updated airbyte-protocol-models-dataclasses dependency version from ^0.14 to ^0.15.
unit_tests/resource/http/response/file_api/article_attachments.json
unit_tests/resource/http/response/file_api/articles.json
Added static JSON files representing articles and attachments for test fixtures.
unit_tests/sources/declarative/file/file_stream_manifest.yaml
unit_tests/sources/declarative/file/test_file_stream_with_filename_extractor.yaml
Added new declarative source manifest files for file-based stream and filename extraction testing.
unit_tests/sources/declarative/file/test_file_stream.py Added comprehensive tests for file-based declarative streams, file reference validation, and discovery.
unit_tests/sources/file_based/in_memory_files_source.py
unit_tests/sources/file_based/test_file_based_stream_reader.py
Refactored test stream readers to use upload instead of get_file and updated related tests.
unit_tests/sources/file_based/scenarios/csv_scenarios.py
unit_tests/sources/file_based/scenarios/incremental_scenarios.py
unit_tests/sources/streams/concurrent/scenarios/stream_facade_scenarios.py
unit_tests/sources/streams/concurrent/scenarios/thread_based_concurrent_stream_scenarios.py
Added "is_file_based": False to expected catalog stream definitions in multiple test scenarios.
unit_tests/sources/file_based/stream/test_default_file_based_stream.py Updated tests to use structured file reference objects and added tests for as_airbyte_stream with file-based flag.
unit_tests/sources/streams/concurrent/test_concurrent_read_processor.py Updated test mocks to use file_reference instead of is_file_transfer_message.
unit_tests/sources/streams/concurrent/test_default_stream.py Added tests for is_file_based flag in DefaultStream.as_airbyte_stream, including file transfer support.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant DeclarativeSource
    participant ModelToComponentFactory
    participant DeclarativeStream
    participant FileUploader
    participant Requester
    participant FileSystem

    User->>DeclarativeSource: Initiate sync (with manifest)
    DeclarativeSource->>ModelToComponentFactory: Build stream (with file_uploader)
    ModelToComponentFactory->>DeclarativeStream: Instantiate (file_uploader attached)
    DeclarativeStream->>FileUploader: For each record, upload(record)
    FileUploader->>Requester: Send request (download_target)
    Requester-->>FileUploader: Return response (file content)
    FileUploader->>FileSystem: Write file to staging directory
    FileSystem-->>FileUploader: File path, size
    FileUploader->>DeclarativeStream: Attach file reference to record
    DeclarativeStream-->>DeclarativeSource: Yield record with file reference
    DeclarativeSource-->>User: Record (with file reference metadata)
Loading

Suggested labels

enhancement

Suggested reviewers

  • maxi297
  • aldogonzalez8

Would you like to add more examples or documentation on how to use the new FileUploader declarative component, wdyt?

Tip

⚡💬 Agentic Chat (Pro Plan, General Availability)
  • We're introducing multi-step agentic chat in review comments and issue comments, within and outside of PR's. This feature enhances review and issue discussions with the CodeRabbit agentic chat by enabling advanced interactions, including the ability to create pull requests directly from comments and add commits to existing pull requests.

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

‼️ IMPORTANT
Auto-reply has been disabled for this repository in the CodeRabbit settings. The CodeRabbit bot will not respond to your replies unless it is explicitly tagged.

  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

maxi297 and others added 6 commits April 2, 2025 10:19
Co-authored-by: Maxime Carbonneau-Leclerc <[email protected]>
Co-authored-by: octavia-squidington-iii <[email protected]>
…otocol changes. (#457)

Co-authored-by: Maxime Carbonneau-Leclerc <[email protected]>
Co-authored-by: octavia-squidington-iii <[email protected]>
Co-authored-by: Aaron ("AJ") Steers <[email protected]>
Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@aldogonzalez8 aldogonzalez8 marked this pull request as ready for review April 17, 2025 18:08
@aldogonzalez8 aldogonzalez8 changed the title PoC for file upload feat(file-api): new upload component for declarative cdk and update protocol support for file-based connectors Apr 17, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🔭 Outside diff range comments (1)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (1)

59-67: 🛠️ Refactor suggestion

Instantiate FileTransfer per–stream rather than as a shared class member?

_file_transfer is created as a class‑level attribute, meaning all streams will reuse the very same instance.
If FileTransfer ever stores mutable state (e.g., temporary paths, counters, auth tokens), different streams could trample on each other during concurrent syncs or when the connector is run in multithreaded fashion. Would it be safer to create it inside __init__ and keep it on self instead, e.g.:

-    _file_transfer = FileTransfer()
+    def __init__(self, **kwargs: Any):
+        ...
+        self._file_transfer = FileTransfer()
+        super().__init__(**kwargs)

This avoids cross‑stream state bleed‑through, wdyt?

🧹 Nitpick comments (9)
airbyte_cdk/test/mock_http/response_builder.py (1)

201-206: Looks good, but would documenting the purpose be helpful?

The find_binary_response function implementation looks solid and follows the same pattern as find_template. It correctly handles loading binary response files for testing purposes.

Consider adding a function docstring to explain its purpose, expected inputs, and return value, similar to the find_template function. This would help other developers understand its usage, wdyt?

airbyte_cdk/sources/utils/files_directory.py (1)

10-15: Function looks good but lacks documentation

The implementation is clean and correctly checks for directory existence before deciding which path to use.

Consider adding a docstring to explain the function's purpose and behavior, especially regarding the fallback logic, wdyt?

 def get_files_directory() -> str:
+    """
+    Returns the directory path to use for file transfers.
+    
+    Prefers AIRBYTE_STAGING_DIRECTORY if it exists on the filesystem,
+    otherwise falls back to DEFAULT_LOCAL_DIRECTORY.
+    """
     return (
         AIRBYTE_STAGING_DIRECTORY
         if os.path.exists(AIRBYTE_STAGING_DIRECTORY)
         else DEFAULT_LOCAL_DIRECTORY
     )
unit_tests/resource/http/response/file_api/article_attachments.json (1)

1-19: Well-structured test fixture for file attachments.

This JSON test fixture provides a realistic representation of article attachments with comprehensive metadata. It includes all the necessary fields for testing file upload functionality: ID, URL, file name, content type, size, and timestamps. The structure matches what would be expected from a real API response.

Would it be helpful to include multiple attachments in the array to test iteration over multiple files, wdyt?

airbyte_cdk/sources/file_based/file_record_data.py (1)

1-24: Well-structured data model for file record metadata.

The FileRecordData class provides a clean Pydantic model for representing file metadata with appropriate field types. It fits nicely into the broader file transfer protocol changes.

However, there's a typo in the copyright year (2025) - shouldn't it be 2024? wdyt?

-# Copyright (c) 2025 Airbyte, Inc., all rights reserved.
+# Copyright (c) 2024 Airbyte, Inc., all rights reserved.
airbyte_cdk/sources/file_based/schema_helpers.py (1)

22-33: Enhanced file transfer schema provides better structure

The updated file_transfer_schema now has well-defined properties for file metadata, which aligns with the new file transfer protocol. Having nullable types for optional fields like id, created_at, etc. is a good approach.

Would it make sense to add some required fields to ensure the minimum necessary information is always present? For example, should file_name or source_uri be required? wdyt?

airbyte_cdk/sources/declarative/retrievers/file_uploader.py (2)

38-44: Avoid recreating an InterpolatedString when one is already provided

InterpolatedString.create gracefully handles strings, but if the caller already gives an InterpolatedString we run the conversion again.
Would an isinstance(self.filename_extractor, InterpolatedString) guard make the intent clearer and skip unnecessary work?


70-78: Directory-creation race & empty‑path issue

file_relative_path may be just a filename (no sub‑folders).
full_path.parent.mkdir(parents=True, exist_ok=True) works, but earlier in _get_file_transfer_paths (reader side) you call makedirs(path.dirname(local_file_path), …) which will receive an empty string in the same situation and crash. Consider adopting the safer pattern used here (create Path(...).parent only when non‑empty) for symmetry, wdyt?

airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

192-198: file_folder is empty when directory structure is not preserved

file_folder = path.dirname(source_file_relative_path) yields "" when preserve_directory_structure is False, yet you still add it to the returned dict. Downstream code may assume a non‑empty folder string. Should we normalise to None or omit the key when empty?

unit_tests/sources/declarative/file/test_file_stream.py (1)

153-191: Excellent filename extraction test.

This test nicely validates the custom filename extraction capability by:

  1. Testing with an alternate YAML manifest that includes the filename extractor
  2. Verifying that the custom filename pattern is used instead of the UUID
  3. Checking that the filename follows the expected format

One question: in the assertion on lines 188-190, you're checking that the path does NOT match the UUID pattern. Would it be clearer to also add a positive assertion that confirms it DOES match the expected custom filename pattern? wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 24cbc51 and 87c20b6.

⛔ Files ignored due to path filters (2)
  • poetry.lock is excluded by !**/*.lock
  • unit_tests/resource/http/response/file_api/article_attachment_content.png is excluded by !**/*.png
📒 Files selected for processing (38)
  • airbyte_cdk/models/__init__.py (1 hunks)
  • airbyte_cdk/models/airbyte_protocol.py (1 hunks)
  • airbyte_cdk/models/file_transfer_record_message.py (0 hunks)
  • airbyte_cdk/sources/concurrent_source/concurrent_read_processor.py (1 hunks)
  • airbyte_cdk/sources/declarative/concurrent_declarative_source.py (5 hunks)
  • airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1 hunks)
  • airbyte_cdk/sources/declarative/extractors/record_selector.py (3 hunks)
  • airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3 hunks)
  • airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (10 hunks)
  • airbyte_cdk/sources/declarative/retrievers/file_uploader.py (1 hunks)
  • airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (4 hunks)
  • airbyte_cdk/sources/file_based/file_record_data.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_types/file_transfer.py (1 hunks)
  • airbyte_cdk/sources/file_based/schema_helpers.py (1 hunks)
  • airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (2 hunks)
  • airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (6 hunks)
  • airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py (1 hunks)
  • airbyte_cdk/sources/streams/concurrent/default_stream.py (3 hunks)
  • airbyte_cdk/sources/types.py (3 hunks)
  • airbyte_cdk/sources/utils/files_directory.py (1 hunks)
  • airbyte_cdk/sources/utils/record_helper.py (3 hunks)
  • airbyte_cdk/test/mock_http/response_builder.py (1 hunks)
  • pyproject.toml (1 hunks)
  • unit_tests/resource/http/response/file_api/article_attachments.json (1 hunks)
  • unit_tests/resource/http/response/file_api/articles.json (1 hunks)
  • unit_tests/sources/declarative/file/file_stream_manifest.yaml (1 hunks)
  • unit_tests/sources/declarative/file/test_file_stream.py (1 hunks)
  • unit_tests/sources/declarative/file/test_file_stream_with_filename_extractor.yaml (1 hunks)
  • unit_tests/sources/file_based/in_memory_files_source.py (1 hunks)
  • unit_tests/sources/file_based/scenarios/csv_scenarios.py (8 hunks)
  • unit_tests/sources/file_based/scenarios/incremental_scenarios.py (15 hunks)
  • unit_tests/sources/file_based/stream/test_default_file_based_stream.py (5 hunks)
  • unit_tests/sources/file_based/test_file_based_stream_reader.py (4 hunks)
  • unit_tests/sources/streams/concurrent/scenarios/stream_facade_scenarios.py (8 hunks)
  • unit_tests/sources/streams/concurrent/scenarios/thread_based_concurrent_stream_scenarios.py (7 hunks)
  • unit_tests/sources/streams/concurrent/test_concurrent_read_processor.py (1 hunks)
  • unit_tests/sources/streams/concurrent/test_default_stream.py (5 hunks)
💤 Files with no reviewable changes (1)
  • airbyte_cdk/models/file_transfer_record_message.py
🧰 Additional context used
🧬 Code Graph Analysis (11)
unit_tests/sources/streams/concurrent/test_concurrent_read_processor.py (1)
airbyte_cdk/sources/types.py (2)
  • file_reference (43-44)
  • file_reference (47-48)
airbyte_cdk/sources/concurrent_source/concurrent_read_processor.py (1)
airbyte_cdk/sources/types.py (2)
  • file_reference (43-44)
  • file_reference (47-48)
airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py (1)
airbyte_cdk/sources/utils/record_helper.py (1)
  • stream_data_to_airbyte_message (20-53)
unit_tests/sources/streams/concurrent/test_default_stream.py (3)
airbyte_cdk/sources/streams/concurrent/default_stream.py (4)
  • as_airbyte_stream (67-89)
  • DefaultStream (20-102)
  • namespace (53-54)
  • name (49-50)
airbyte_cdk/sources/streams/concurrent/abstract_stream.py (2)
  • as_airbyte_stream (80-83)
  • name (54-57)
airbyte_cdk/sources/streams/concurrent/cursor.py (1)
  • FinalStateCursor (85-124)
airbyte_cdk/test/mock_http/response_builder.py (1)
airbyte_cdk/test/utils/data.py (1)
  • get_unit_test_folder (6-14)
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (5)
airbyte_cdk/sources/types.py (3)
  • Record (21-72)
  • data (35-36)
  • associated_slice (39-40)
airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (1)
  • stream_name (301-302)
airbyte_cdk/sources/streams/concurrent/adapters.py (1)
  • stream_name (328-329)
airbyte_cdk/sources/streams/concurrent/partitions/partition.py (1)
  • stream_name (36-41)
unit_tests/sources/streams/concurrent/scenarios/thread_based_concurrent_stream_source_builder.py (1)
  • stream_name (120-121)
unit_tests/sources/file_based/in_memory_files_source.py (4)
airbyte_cdk/sources/declarative/retrievers/file_uploader.py (1)
  • upload (45-89)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
  • upload (157-176)
airbyte_cdk/sources/file_based/file_types/file_transfer.py (1)
  • upload (18-30)
unit_tests/sources/file_based/test_file_based_stream_reader.py (1)
  • upload (85-88)
airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (3)
airbyte_cdk/sources/types.py (4)
  • data (35-36)
  • Record (21-72)
  • file_reference (43-44)
  • file_reference (47-48)
airbyte_cdk/sources/streams/concurrent/exceptions.py (1)
  • ExceptionWithDisplayMessage (8-18)
airbyte_cdk/sources/streams/concurrent/adapters.py (1)
  • stream_name (328-329)
airbyte_cdk/sources/utils/record_helper.py (1)
airbyte_cdk/sources/types.py (3)
  • file_reference (43-44)
  • file_reference (47-48)
  • data (35-36)
unit_tests/sources/declarative/file/test_file_stream.py (5)
airbyte_cdk/test/mock_http/response_builder.py (5)
  • Path (31-41)
  • find_binary_response (201-206)
  • find_template (189-198)
  • build (145-146)
  • build (179-181)
airbyte_cdk/sources/declarative/yaml_declarative_source.py (1)
  • YamlDeclarativeSource (17-67)
airbyte_cdk/test/catalog_builder.py (3)
  • CatalogBuilder (48-81)
  • ConfiguredAirbyteStreamBuilder (13-45)
  • with_name (27-29)
airbyte_cdk/test/entrypoint_wrapper.py (4)
  • EntrypointOutput (49-152)
  • discover (186-203)
  • read (206-244)
  • catalog (113-117)
airbyte_cdk/test/mock_http/mocker.py (1)
  • HttpMocker (25-185)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
airbyte_cdk/sources/declarative/retrievers/file_uploader.py (1)
  • FileUploader (29-89)
airbyte_cdk/sources/declarative/models/declarative_component_schema.py (17)
  • FileUploader (2069-2091)
  • Config (132-133)
  • Config (146-147)
  • Config (160-161)
  • Config (174-175)
  • Config (192-193)
  • Config (206-207)
  • Config (220-221)
  • Config (234-235)
  • Config (248-249)
  • Config (262-263)
  • Config (276-277)
  • Config (290-291)
  • Config (306-307)
  • Config (320-321)
  • Config (334-335)
  • Config (368-369)
🔇 Additional comments (71)
airbyte_cdk/sources/utils/files_directory.py (1)

6-7: Good environment variable fallback approach!

The constants are well-defined with a reasonable fallback value for the staging directory.

pyproject.toml (1)

33-34: Dependency update for protocol models looks good

Updating airbyte-protocol-models-dataclasses to version ^0.15 aligns with the protocol changes mentioned in the PR objectives, supporting the new file reference capabilities.

unit_tests/sources/streams/concurrent/test_concurrent_read_processor.py (1)

89-89: Test setup updated correctly for model changes

This change aligns with the update from boolean flag is_file_transfer_message to the more expressive file_reference attribute. The None value correctly represents that this record does not reference a file.

airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py (1)

64-64: Updated to use new file reference approach.

The removal of the explicit is_file_transfer_message=False parameter aligns with the broader refactoring that replaces boolean flags with structured file references. Since permission records are not file transfer messages, omitting this parameter allows the function to use its default file_reference=None implicitly, which is the correct behavior here. Nice cleanup!

airbyte_cdk/models/__init__.py (1)

22-22: Appropriate addition of AirbyteRecordMessageFileReference import.

This addition correctly exposes the AirbyteRecordMessageFileReference class as part of the public API, which is essential for the file reference refactoring throughout the codebase. The import positioning maintains the alphabetical order of the imports.

airbyte_cdk/sources/concurrent_source/concurrent_read_processor.py (1)

152-152: Updated to use file reference instead of boolean flag.

This change correctly updates the parameter from the deprecated is_file_transfer_message boolean flag to the new file_reference attribute, completing the transition to the structured file reference approach. This adjustment maintains consistency with the refactored Record class and stream_data_to_airbyte_message function.

unit_tests/sources/streams/concurrent/scenarios/stream_facade_scenarios.py (1)

120-120: The addition of is_file_based: False improves test case catalog consistency.

These additions align the test scenarios with the new file transfer protocol support, ensuring that stream metadata explicitly indicates file transfer capabilities in the expected catalog. This properly validates that non-file-based streams will correctly expose this property in their stream metadata.

Also applies to: 166-166, 199-199, 233-233, 245-245, 277-277, 311-311, 345-345

unit_tests/sources/streams/concurrent/scenarios/thread_based_concurrent_stream_scenarios.py (1)

314-314: The addition of is_file_based: False ensures consistent stream metadata across test scenarios.

Similar to the changes in stream_facade_scenarios.py, these additions properly align the test scenarios with the file transfer protocol changes, making explicit that these test streams are not file-based. Good to see consistent implementation across related test files.

Also applies to: 355-355, 435-435, 448-448, 488-488, 530-530, 572-572

airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (1)

60-70: Preserving existing Record instances enables file reference metadata preservation.

This is a key improvement that allows pre-existing Record objects to flow through the system without losing their metadata. Previously, all mappings would be wrapped in a new Record, potentially losing additional properties like file references.

The change is particularly important for the new file upload functionality, ensuring file reference information is preserved when Records move through the streaming pipeline. Nice job on making this change type-safe with proper instance checking.

airbyte_cdk/models/airbyte_protocol.py (1)

85-85: Simplified record type aligns with the new file transfer protocol.

This change from a union type to just AirbyteRecordMessage reflects the architectural decision to eliminate the separate AirbyteFileTransferRecordMessage class in favor of embedding file reference information directly in regular record messages.

This approach is cleaner and more consistent, consolidating the message types while still supporting the file transfer functionality through the new file_reference property. The change successfully removes the legacy protocol implementation as intended in the PR objectives.

unit_tests/sources/file_based/scenarios/incremental_scenarios.py (1)

95-95: Looks good - explicit flag for file-based streams in catalog schemas.

The addition of "is_file_based": False in all the test scenario stream schemas aligns with the broader changes to support the file transfer protocol. This makes the expected catalog definitions future-proof and explicit about the stream's capability.

Also applies to: 176-176, 275-275, 336-336, 453-453, 554-554, 681-681, 758-758, 821-821, 900-900, 983-983, 1136-1136, 1267-1267, 1458-1458, 1642-1642, 1759-1759, 1900-1900

unit_tests/sources/file_based/in_memory_files_source.py (1)

143-146: Method rename from get_file to upload looks great.

The method rename aligns with the broader refactoring in the codebase where file retrieval methods are renamed to better reflect their purpose in the new file transfer architecture. The implementation remains unchanged (returning an empty dict), which is appropriate for this test implementation.

airbyte_cdk/sources/declarative/extractors/record_selector.py (3)

18-18: Import addition for FileUploader looks good.

The import enables the integration with the new file uploader functionality.


46-46: New optional FileUploader parameter makes sense.

Adding an optional file_uploader field to the RecordSelector class enables file upload functionality without breaking existing implementations. The default of None keeps backward compatibility.


122-125: Elegant integration of file uploading during record processing.

This change neatly integrates file uploading into the record processing flow. The implementation:

  1. Creates the record first
  2. Conditionally calls the uploader if it exists
  3. Yields the record with potential modifications from the upload process

This approach maintains backward compatibility while enabling the new functionality. Just one question - should we handle any exceptions from the upload process here, or is that handled inside the uploader? wdyt?

unit_tests/resource/http/response/file_api/articles.json (1)

1-37: Test fixture for article API response looks well-structured.

This JSON fixture provides a realistic API response for testing file upload functionality. It includes:

  • Pagination metadata (count, next_page)
  • A detailed article object with proper identifiers, timestamps, and metadata
  • HTML content with an embedded image URL which can be used to test attachment handling

Good test data is crucial for robust testing of the file upload features.

airbyte_cdk/sources/streams/concurrent/default_stream.py (1)

32-32: LGTM: Clean implementation of file transfer support.

The addition of the supports_file_transfer parameter with a sensible default value (False) and the corresponding exposure via the is_file_based property in the AirbyteStream looks good. This aligns well with the PR's objective of implementing file upload functionality in API sources.

Also applies to: 43-43, 73-73

unit_tests/sources/streams/concurrent/test_default_stream.py (2)

78-78: Updates to existing tests correctly preserve expected behavior.

Adding is_file_based=False to all the expected AirbyteStream instances ensures existing tests continue to pass with the new parameter. Good test maintenance!

Also applies to: 116-116, 154-154, 192-192, 223-223


229-259: Well-structured test for the new file transfer support.

The new test correctly verifies that setting supports_file_transfer=True results in an AirbyteStream with is_file_based=True. This completes the test coverage for the changes to the DefaultStream class.

airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (2)

7-7: Simplified import - nice cleanup.

Only importing what's actually used improves code clarity.


261-269: Cleaner file record handling.

The simplified approach to record data extraction and the switch from a boolean flag to a structured file reference improves code clarity and aligns with the broader refactoring in the PR.

unit_tests/sources/file_based/scenarios/csv_scenarios.py (2)

693-693: Added new property for file transfer support

The addition of "is_file_based": False here and in other test scenarios aligns with the file transfer protocol changes mentioned in the PR objectives. This will ensure test scenarios properly validate the file transfer capabilities.

Just confirming - are you planning to add test scenarios for "is_file_based": True cases as well to validate both behavior modes? wdyt?


1144-1144: Consistent file-based flag pattern applied to test scenarios

I see you've systematically added the "is_file_based": False flag to multiple test scenarios, which ensures consistency across the testing framework. This supports the PR's goal of aligning with the updated file transfer record protocol.

Would it make sense to create a helper constant or method for adding this property to streamline future additions? Just a thought!

Also applies to: 1232-1233, 2111-2112, 2196-2197, 2214-2215, 2633-2634

airbyte_cdk/sources/declarative/concurrent_declarative_source.py (3)

28-28: Added FileUploader import

This import establishes support for the file uploader component in the declarative framework. It's a necessary foundation for the file upload functionality.


210-212: Added detection for file transfer support

The implementation checks if a stream supports file transfer by looking for a "file_uploader" key in the stream's configuration. This is a clean approach to detect this capability.

I'm curious about error handling - what happens if the file_uploader configuration is missing required parameters? Would additional validation be helpful here, or is that handled elsewhere? wdyt?


330-330: Propagated file transfer support to DefaultStream

You've consistently added the supports_file_transfer parameter to all DefaultStream constructor calls, ensuring the capability is properly transmitted to downstream components. This implementation allows for seamless file transfer support in the declarative framework.

Also applies to: 362-363, 416-417

airbyte_cdk/sources/types.py (3)

9-9: Updated imports for file reference support

Added import for AirbyteRecordMessageFileReference to support the new file reference mechanism, replacing the older boolean flag approach. This is a good foundation for the enhanced file transfer support.


27-28: Enhanced file transfer with structured references

You've replaced the simple boolean is_file_transfer_message with a more structured file_reference parameter. This is a significant improvement that provides more flexibility and information about transferred files.

The change from a boolean flag to a structured reference type allows for more metadata and capabilities. Nice enhancement!

Also applies to: 32-33


42-49: Added getter and setter for file reference

The property pattern implementation for file_reference follows Python best practices, providing a clean interface for accessing and modifying the file reference.

Is there any validation needed in the setter to ensure the reference is properly formatted? Or is that handled at the AirbyteRecordMessageFileReference level? Just wondering!

airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)

1451-1486: New file uploader component looks great!

The new file_uploader schema is well-structured and provides a clear path for defining file upload functionality in declarative sources. The required properties ensure the minimum configuration needed, while the optional properties allow for flexibility.

I particularly like the detailed descriptions and examples provided for the filename_extractor - this will help users understand how to properly format their file paths.

unit_tests/sources/file_based/test_file_based_stream_reader.py (2)

85-88: Method signature updated to support new upload flow

The renaming from get_file to upload aligns with the PR's goal of updating file-based connectors to use the new file transfer protocol.


449-458: File path handling tests look comprehensive

The updated tests for _get_file_transfer_paths cover the various configuration scenarios for directory structure preservation.

These tests nicely verify that the path components are correctly constructed for different configuration options. I like how you're explicitly checking each component of the returned dictionary.

airbyte_cdk/sources/utils/record_helper.py (3)

12-12: Good addition of the new file reference type import.

The addition of AirbyteRecordMessageFileReference to the imports aligns well with the updated file transfer record protocol you're implementing.


25-25: Nice improvement to the function signature.

Replacing the boolean is_file_transfer_message with a typed file_reference parameter makes the code more type-safe and self-documenting. This change aligns well with the goal of refining file transfer handling.


39-44: Clean implementation of the updated message creation logic.

The updated implementation correctly uses the optional file_reference in the AirbyteRecordMessage constructor. This unifies the message creation logic and eliminates the need for a separate file transfer message type.

airbyte_cdk/sources/file_based/file_types/file_transfer.py (5)

5-5: Good update to importing the tuple type.

The import change reflects the updated return type using tuple for the file data and reference. This makes the API more structured and explicit.


7-11: Nice import updates for the new data structures.

The imports properly include the new AirbyteRecordMessageFileReference and FileRecordData types, along with the centralized get_files_directory utility.


16-16: Good centralization of directory path handling.

Using the new get_files_directory() utility function provides a consistent way to initialize the local directory path across the codebase.


18-23: Well-defined method signature update.

Renaming get_file to upload better reflects the purpose of the method. The return type is now a well-structured tuple of FileRecordData and AirbyteRecordMessageFileReference instead of a dictionary, which provides better type safety.


25-25: Consistent method name update.

The call to the stream reader now uses upload instead of get_file, maintaining consistency with the method's new name.

airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3)

2069-2091: Well-structured new FileUploader model.

Great addition of the FileUploader model for declarative file upload support. The model includes all necessary components:

  • A requester for HTTP requests
  • Extract mechanisms for file locations and content
  • Filename extraction capabilities
  • Good documentation in the field descriptions

This addition enables the declarative source framework to support file uploads as described in the PR objectives.


2152-2156: Good integration with the DeclarativeStream model.

Adding the optional file_uploader field to the DeclarativeStream model with the experimental marking is a clean way to introduce this new capability without disrupting existing implementations.


2650-2650: Important forward reference update.

Don't forget to add the FileUploader.update_forward_refs() call to ensure proper Pydantic model initialization.

unit_tests/sources/file_based/stream/test_default_file_based_stream.py (6)

15-21: Good import updates for test coverage.

The imports now include the necessary types like AirbyteRecordMessageFileReference to support testing the new file transfer implementation.


34-34: Proper import for the new file record data model.

Adding the import for FileRecordData ensures the tests can properly work with the new structured data model.


286-296: Excellent test data structure updates.

Replacing the simple dictionary with proper FileRecordData and AirbyteRecordMessageFileReference objects provides much better test coverage for the new file reference implementation.


326-330: Updated test mocking to match the new API.

The mock update correctly targets the renamed upload method and returns the properly structured tuple of file data and reference objects. Good consistency with the implementation changes.


334-344: Well-updated assertions.

The assertions now correctly check for the new file reference structure, ensuring that tests validate the actual behavior of the updated implementation.


476-557: Excellent new test class for schema validation.

The new DefaultFileBasedStreamSchemaTest provides great coverage for the stream schema behavior, testing both when file transfer is enabled and disabled. This ensures the is_file_based flag is properly set on the AirbyteStream objects.

The tests are well-structured with clear mocking of dependencies and assertions that focus on the behavior being tested.

airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)

154-161: Bypass of validation / transformation when use_file_transfer is true

Inside the use_file_transfer branch you yield the raw file_record_data.dict(...) without passing it through record_passes_validation_policy or transform_record.

Is the intent that file_record_data already contains the _ab_source_file_* columns and schema‑validated payload? If not, we might emit inconsistent records compared to the non‑transfer path. Maybe we should still call transform_record (and possibly validation) on the dict prior to emitting, wdyt?


316-319: Guard against missing is_file_based field on AirbyteStream

as_airbyte_stream mutates file_stream.is_file_based, but the protocol model did not previously expose that attribute. Could you double‑check that the generated AirbyteStream class now contains it, otherwise a runtime AttributeError will surface?

unit_tests/sources/declarative/file/file_stream_manifest.yaml (4)

3-4: Nice type definition!

The DeclarativeSource type is properly specified here. This is a good example of how to structure a version 2.0.0 declarative source manifest.


25-30: Good use of SelectiveAuthenticator!

The selective authenticator pattern allows the connector to handle multiple authentication methods elegantly. This provides flexibility for users without complicating the implementation.


117-149: Well-designed parent-child stream relationship.

The SubstreamPartitionRouter is effectively used to create a dependency between the article_attachments stream and the articles stream. The incremental_dependency: true setting ensures that only attachments for articles that have changed will be synced, which is an efficient design pattern.


149-164: Great implementation of file uploader!

The file uploader component is well-structured with:

  1. A properly configured HTTP requester that inherits authentication from the main API
  2. A clear download target extractor pointing to "content_url"
  3. Reuse of the same authentication selection logic

This showcases the new file transfer capabilities nicely. The component will download files from URLs extracted from API responses while maintaining proper authentication.

unit_tests/sources/declarative/file/test_file_stream.py (5)

19-29: Nice config builder implementation.

The ConfigBuilder provides a clean way to generate test configurations with all required fields. This simplifies the test methods and makes the code more maintainable.


32-46: Good source factory function.

The _source helper function centralizes the creation of the declarative source, making tests more readable and reducing duplication. The default yaml_file parameter is a nice touch that simplifies the common case while allowing override for special tests.


81-97: Well-structured connection test.

The test correctly mocks the HTTP request to the articles endpoint and verifies that the check operation succeeds. This ensures that the connection check logic works properly with the new file transfer capabilities.


114-152: Great file reference validation!

This test thoroughly validates all aspects of the file reference:

  1. Presence of the file reference in the record
  2. Validation of the staging file URL format with regex
  3. Verification of the source file relative path
  4. Confirmation that file size is captured

The UUID pattern validation is particularly good for ensuring proper file path generation.


193-201: Good discovery test.

The test confirms that streams with file uploaders are correctly marked as file-based during discovery. This ensures that the platform correctly identifies file-based streams for special processing.

unit_tests/sources/declarative/file/test_file_stream_with_filename_extractor.yaml (1)

149-164: Well-structured file uploader with filename extraction.

The file uploader configuration is well-designed, reusing the authentication pattern from the main API and clearly defining the extraction paths.

The filename_extractor on line 164 uses a template to generate custom filenames based on record fields, which is a powerful feature. Just checking - is the trailing slash after {{ record.file_name }} intentional? It could create a directory structure with an empty directory name at the end. wdyt?

airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (10)

230-232: Clean import addition for FileUploader model.

Good job adding the import for the FileUploader model which defines the schema for the declarative component.


484-484: Appropriate import for the FileUploader implementation class.

The import for the actual FileUploader class implementation looks good. This follows the established pattern of separating models and their implementations.


682-682: Well-placed mapping entry for the FileUploader factory method.

This mapping links the FileUploaderModel to the create_file_uploader method that will instantiate the runtime component. Nicely added at the appropriate location in the dictionary.


1845-1849: Good implementation of file_uploader detection and creation in create_declarative_stream.

You've properly implemented the logic to detect and instantiate the file_uploader component when present in the stream model. This follows the established pattern for handling optional components.


1861-1861: Correctly propagated file_uploader to the retriever.

The file_uploader is correctly passed to the retriever creation, ensuring it's available for use during stream retrieval.


2807-2807: Properly updated create_record_selector signature with file_uploader parameter.

You've correctly updated the method signature to accept an optional file_uploader parameter, which is necessary for propagating the component through the creation hierarchy.


2845-2846: Good integration of file_uploader in RecordSelector construction.

The file_uploader is correctly passed to the RecordSelector constructor, enabling it to be used during record processing.


2903-2904: Well-updated create_simple_retriever signature with file_uploader parameter.

The method signature is correctly updated to accept and propagate the optional file_uploader parameter.


2924-2925: Correctly passed file_uploader to record_selector in create_simple_retriever.

The file_uploader is properly passed to the record_selector creation, maintaining the component chain.


3591-3613: Well-implemented create_file_uploader method.

The implementation of create_file_uploader follows the established patterns for component creation:

  1. Creates required sub-components (requester and download_target_extractor)
  2. Correctly handles parameters and configuration
  3. Properly handles optional filename_extractor
  4. Returns a fully initialized FileUploader instance

This is a clean implementation that aligns with the factory's existing coding style and patterns.

Copy link
Contributor

@aldogonzalez8 aldogonzalez8 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

APPROVED

@github-actions github-actions bot added the enhancement New feature or request label Apr 17, 2025
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🔭 Outside diff range comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

1-212: 💡 Verification agent

🧩 Analysis chain

Format issue flagged by pipeline

The linter found a formatting issue with this file. Running the formatter would help ensure consistent code style.


🏁 Script executed:

#!/bin/bash
# Display formatting issues with the file
rg "^from typing import" airbyte_cdk/sources/file_based/file_based_stream_reader.py -A 2 -B 2

Length of output: 280


Could we verify the formatting with Black to ensure consistency? wdyt?


🏁 Script executed:

#!/bin/bash
# Install Black if needed and check formatting diffs
pip install black
black --check --diff airbyte_cdk/sources/file_based/file_based_stream_reader.py

Length of output: 1361


Could we try running Black via the Python module to bypass the PATH issue? wdyt?


🏁 Script executed:

#!/bin/bash
# Check formatting using Black through python -m
python3 -m black --check --diff airbyte_cdk/sources/file_based/file_based_stream_reader.py

Length of output: 2570


Apply Black formatting to file_based_stream_reader.py

Black reports that airbyte_cdk/sources/file_based/file_based_stream_reader.py would be reformatted. Could you run:

black airbyte_cdk/sources/file_based/file_based_stream_reader.py

to ensure consistent styling? wdyt?

🧰 Tools
🪛 GitHub Actions: Linters

[error] 1-1: Ruff formatting check failed. File would be reformatted.

♻️ Duplicate comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

203-203: ⚠️ Potential issue

Potential bug with empty directory paths

This line could raise a FileNotFoundError when source_file_relative_path is a bare filename and preserve_directory_structure is False, as path.dirname(local_file_path) would return an empty string.

Could we guard against this scenario by checking if the directory name is empty?

- makedirs(path.dirname(local_file_path), exist_ok=True)
+ dir_name = path.dirname(local_file_path)
+ if dir_name:
+     makedirs(dir_name, exist_ok=True)
🧹 Nitpick comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)

171-175: Updated docstring for better specificity

The docstring now clearly describes the return value fields in the AirbyteRecordMessageFileReference object. Could we enhance it further by also mentioning the first part of the tuple (FileRecordData)?

 Returns:
-    AirbyteRecordMessageFileReference: A file reference object containing:
-        - staging_file_url (str): The absolute path to the referenced file in the staging area.
-        - file_size_bytes (int): The size of the referenced file in bytes.
-        - source_file_relative_path (str): The relative path to the referenced file in source.
+    Tuple[FileRecordData, AirbyteRecordMessageFileReference]: A tuple containing:
+        - FileRecordData: Object with file metadata (folder, filename, bytes, source_uri, etc.)
+        - AirbyteRecordMessageFileReference: A file reference object containing:
+            - staging_file_url (str): The absolute path to the referenced file in the staging area.
+            - file_size_bytes (int): The size of the referenced file in bytes.
+            - source_file_relative_path (str): The relative path to the referenced file in source.

wdyt?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 87c20b6 and ab5d4e7.

📒 Files selected for processing (2)
  • airbyte_cdk/sources/declarative/retrievers/file_uploader.py (1 hunks)
  • airbyte_cdk/sources/file_based/file_based_stream_reader.py (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • airbyte_cdk/sources/declarative/retrievers/file_uploader.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (5)
airbyte_cdk/sources/file_based/config/validate_config_transfer_modes.py (3)
  • include_identities_stream (65-81)
  • preserve_directory_structure (26-45)
  • use_file_transfer (18-23)
airbyte_cdk/sources/file_based/file_record_data.py (1)
  • FileRecordData (11-23)
airbyte_cdk/sources/declarative/retrievers/file_uploader.py (1)
  • upload (45-93)
airbyte_cdk/sources/file_based/file_types/file_transfer.py (1)
  • upload (18-30)
airbyte_cdk/sources/file_based/remote_file.py (1)
  • RemoteFile (11-18)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/file_based_stream_reader.py

[error] 1-1: Ruff formatting check failed. File would be reformatted.

⏰ Context from checks skipped due to timeout of 90000ms (8)
  • GitHub Check: Check: 'source-pokeapi' (skip=false)
  • GitHub Check: Check: 'source-amplitude' (skip=false)
  • GitHub Check: Check: 'source-shopify' (skip=false)
  • GitHub Check: Check: 'source-hardcoded-records' (skip=false)
  • GitHub Check: Pytest (All, Python 3.11, Ubuntu)
  • GitHub Check: Pytest (All, Python 3.10, Ubuntu)
  • GitHub Check: Pytest (Fast)
  • GitHub Check: SDM Docker Image Build
🔇 Additional comments (5)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (5)

11-11: Added Tuple to typing imports - great addition for type safety

The addition of Tuple to the typing imports will help with the updated return type in the upload method. Good practice for maintaining type safety!


33-36: New class constants improve code readability

Adding these string constants as class attributes is a good practice as it makes the code more maintainable and prevents string literal duplication. These constants clearly define the keys used in the file paths dictionary.


157-159: Method renamed from get_file to upload with improved return type

Renaming to upload better represents the method's purpose, and the return type is now explicitly a tuple of FileRecordData and AirbyteRecordMessageFileReference types, which aligns with the new file reference protocol.


178-191: Refactored method with improved parameter names and validation

The method signature is clearer now, taking specific source_file_relative_path and staging_directory parameters instead of a RemoteFile object. The added validation for staging directory existence is a good safeguard.


205-211: Dictionary return value improves structure and reusability

Using a dictionary with named keys (leveraging the new class constants) makes the return value more structured and self-documenting. This approach is more maintainable than returning a list of values.

@aldogonzalez8
Copy link
Contributor

aldogonzalez8 commented Apr 17, 2025

/autofix

Auto-Fix Job Info

This job attempts to auto-fix any linting or formating issues. If any fixes are made,
those changes will be automatically committed and pushed back to the PR.

Note: This job can only be run by maintainers. On PRs from forks, this command requires
that the PR author has enabled the Allow edits from maintainers option.

PR auto-fix job started... Check job output.

✅ Changes applied successfully.

@aldogonzalez8
Copy link
Contributor

Error for source-google-drive is expected, we have PRs to update all file-based connectors:

Standard error(Copy to clipboard):
resolve: process "python /airbyte/integration_code/main.py spec" did not complete successfully: exit code: 1
Stdout:

Stderr:
Traceback (most recent call last):
  File "/airbyte/integration_code/main.py", line 9, in <module>
    run()
  File "/airbyte/integration_code/source_google_drive/run.py", line 17, in run
    source = SourceGoogleDrive(
             ^^^^^^^^^^^^^^^^^^
  File "/airbyte/integration_code/source_google_drive/source.py", line 20, in __init__
    stream_reader=SourceGoogleDriveStreamReader(),
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: Can't instantiate abstract class SourceGoogleDriveStreamReader with abstract method upload

@aldogonzalez8 aldogonzalez8 merged commit 4f0fcbc into main Apr 22, 2025
30 of 32 checks passed
@aldogonzalez8 aldogonzalez8 deleted the maxi297/poc-file-upload branch April 22, 2025 14:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants