-
Notifications
You must be signed in to change notification settings - Fork 17
feat(file-api): new upload component for declarative cdk and update protocol support for file-based connectors #433
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
/autofix
|
📝 WalkthroughWalkthroughThis update introduces a unified and extensible file transfer mechanism into the Airbyte CDK, primarily by integrating a new Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant DeclarativeSource
participant ModelToComponentFactory
participant DeclarativeStream
participant FileUploader
participant Requester
participant FileSystem
User->>DeclarativeSource: Initiate sync (with manifest)
DeclarativeSource->>ModelToComponentFactory: Build stream (with file_uploader)
ModelToComponentFactory->>DeclarativeStream: Instantiate (file_uploader attached)
DeclarativeStream->>FileUploader: For each record, upload(record)
FileUploader->>Requester: Send request (download_target)
Requester-->>FileUploader: Return response (file content)
FileUploader->>FileSystem: Write file to staging directory
FileSystem-->>FileUploader: File path, size
FileUploader->>DeclarativeStream: Attach file reference to record
DeclarativeStream-->>DeclarativeSource: Yield record with file reference
DeclarativeSource-->>User: Record (with file reference metadata)
Suggested labels
Suggested reviewers
Would you like to add more examples or documentation on how to use the new Tip ⚡💬 Agentic Chat (Pro Plan, General Availability)
🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
airbyte_cdk/sources/declarative/concurrent_declarative_source.py
Outdated
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py
Show resolved
Hide resolved
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py
Show resolved
Hide resolved
Co-authored-by: Maxime Carbonneau-Leclerc <[email protected]> Co-authored-by: octavia-squidington-iii <[email protected]>
…otocol changes. (#457) Co-authored-by: Maxime Carbonneau-Leclerc <[email protected]> Co-authored-by: octavia-squidington-iii <[email protected]> Co-authored-by: Aaron ("AJ") Steers <[email protected]> Co-authored-by: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🔭 Outside diff range comments (1)
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (1)
59-67
: 🛠️ Refactor suggestionInstantiate
FileTransfer
per–stream rather than as a shared class member?
_file_transfer
is created as a class‑level attribute, meaning all streams will reuse the very same instance.
IfFileTransfer
ever stores mutable state (e.g., temporary paths, counters, auth tokens), different streams could trample on each other during concurrent syncs or when the connector is run in multithreaded fashion. Would it be safer to create it inside__init__
and keep it onself
instead, e.g.:- _file_transfer = FileTransfer() + def __init__(self, **kwargs: Any): + ... + self._file_transfer = FileTransfer() + super().__init__(**kwargs)This avoids cross‑stream state bleed‑through, wdyt?
🧹 Nitpick comments (9)
airbyte_cdk/test/mock_http/response_builder.py (1)
201-206
: Looks good, but would documenting the purpose be helpful?The
find_binary_response
function implementation looks solid and follows the same pattern asfind_template
. It correctly handles loading binary response files for testing purposes.Consider adding a function docstring to explain its purpose, expected inputs, and return value, similar to the
find_template
function. This would help other developers understand its usage, wdyt?airbyte_cdk/sources/utils/files_directory.py (1)
10-15
: Function looks good but lacks documentationThe implementation is clean and correctly checks for directory existence before deciding which path to use.
Consider adding a docstring to explain the function's purpose and behavior, especially regarding the fallback logic, wdyt?
def get_files_directory() -> str: + """ + Returns the directory path to use for file transfers. + + Prefers AIRBYTE_STAGING_DIRECTORY if it exists on the filesystem, + otherwise falls back to DEFAULT_LOCAL_DIRECTORY. + """ return ( AIRBYTE_STAGING_DIRECTORY if os.path.exists(AIRBYTE_STAGING_DIRECTORY) else DEFAULT_LOCAL_DIRECTORY )unit_tests/resource/http/response/file_api/article_attachments.json (1)
1-19
: Well-structured test fixture for file attachments.This JSON test fixture provides a realistic representation of article attachments with comprehensive metadata. It includes all the necessary fields for testing file upload functionality: ID, URL, file name, content type, size, and timestamps. The structure matches what would be expected from a real API response.
Would it be helpful to include multiple attachments in the array to test iteration over multiple files, wdyt?
airbyte_cdk/sources/file_based/file_record_data.py (1)
1-24
: Well-structured data model for file record metadata.The
FileRecordData
class provides a clean Pydantic model for representing file metadata with appropriate field types. It fits nicely into the broader file transfer protocol changes.However, there's a typo in the copyright year (2025) - shouldn't it be 2024? wdyt?
-# Copyright (c) 2025 Airbyte, Inc., all rights reserved. +# Copyright (c) 2024 Airbyte, Inc., all rights reserved.airbyte_cdk/sources/file_based/schema_helpers.py (1)
22-33
: Enhanced file transfer schema provides better structureThe updated
file_transfer_schema
now has well-defined properties for file metadata, which aligns with the new file transfer protocol. Having nullable types for optional fields likeid
,created_at
, etc. is a good approach.Would it make sense to add some required fields to ensure the minimum necessary information is always present? For example, should
file_name
orsource_uri
be required? wdyt?airbyte_cdk/sources/declarative/retrievers/file_uploader.py (2)
38-44
: Avoid recreating anInterpolatedString
when one is already provided
InterpolatedString.create
gracefully handles strings, but if the caller already gives anInterpolatedString
we run the conversion again.
Would anisinstance(self.filename_extractor, InterpolatedString)
guard make the intent clearer and skip unnecessary work?
70-78
: Directory-creation race & empty‑path issue
file_relative_path
may be just a filename (no sub‑folders).
full_path.parent.mkdir(parents=True, exist_ok=True)
works, but earlier in_get_file_transfer_paths
(reader side) you callmakedirs(path.dirname(local_file_path), …)
which will receive an empty string in the same situation and crash. Consider adopting the safer pattern used here (createPath(...).parent
only when non‑empty) for symmetry, wdyt?airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
192-198
:file_folder
is empty when directory structure is not preserved
file_folder = path.dirname(source_file_relative_path)
yields""
whenpreserve_directory_structure
isFalse
, yet you still add it to the returned dict. Downstream code may assume a non‑empty folder string. Should we normalise toNone
or omit the key when empty?unit_tests/sources/declarative/file/test_file_stream.py (1)
153-191
: Excellent filename extraction test.This test nicely validates the custom filename extraction capability by:
- Testing with an alternate YAML manifest that includes the filename extractor
- Verifying that the custom filename pattern is used instead of the UUID
- Checking that the filename follows the expected format
One question: in the assertion on lines 188-190, you're checking that the path does NOT match the UUID pattern. Would it be clearer to also add a positive assertion that confirms it DOES match the expected custom filename pattern? wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
⛔ Files ignored due to path filters (2)
poetry.lock
is excluded by!**/*.lock
unit_tests/resource/http/response/file_api/article_attachment_content.png
is excluded by!**/*.png
📒 Files selected for processing (38)
airbyte_cdk/models/__init__.py
(1 hunks)airbyte_cdk/models/airbyte_protocol.py
(1 hunks)airbyte_cdk/models/file_transfer_record_message.py
(0 hunks)airbyte_cdk/sources/concurrent_source/concurrent_read_processor.py
(1 hunks)airbyte_cdk/sources/declarative/concurrent_declarative_source.py
(5 hunks)airbyte_cdk/sources/declarative/declarative_component_schema.yaml
(1 hunks)airbyte_cdk/sources/declarative/extractors/record_selector.py
(3 hunks)airbyte_cdk/sources/declarative/models/declarative_component_schema.py
(3 hunks)airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py
(10 hunks)airbyte_cdk/sources/declarative/retrievers/file_uploader.py
(1 hunks)airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py
(1 hunks)airbyte_cdk/sources/file_based/file_based_stream_reader.py
(4 hunks)airbyte_cdk/sources/file_based/file_record_data.py
(1 hunks)airbyte_cdk/sources/file_based/file_types/file_transfer.py
(1 hunks)airbyte_cdk/sources/file_based/schema_helpers.py
(1 hunks)airbyte_cdk/sources/file_based/stream/concurrent/adapters.py
(2 hunks)airbyte_cdk/sources/file_based/stream/default_file_based_stream.py
(6 hunks)airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py
(1 hunks)airbyte_cdk/sources/streams/concurrent/default_stream.py
(3 hunks)airbyte_cdk/sources/types.py
(3 hunks)airbyte_cdk/sources/utils/files_directory.py
(1 hunks)airbyte_cdk/sources/utils/record_helper.py
(3 hunks)airbyte_cdk/test/mock_http/response_builder.py
(1 hunks)pyproject.toml
(1 hunks)unit_tests/resource/http/response/file_api/article_attachments.json
(1 hunks)unit_tests/resource/http/response/file_api/articles.json
(1 hunks)unit_tests/sources/declarative/file/file_stream_manifest.yaml
(1 hunks)unit_tests/sources/declarative/file/test_file_stream.py
(1 hunks)unit_tests/sources/declarative/file/test_file_stream_with_filename_extractor.yaml
(1 hunks)unit_tests/sources/file_based/in_memory_files_source.py
(1 hunks)unit_tests/sources/file_based/scenarios/csv_scenarios.py
(8 hunks)unit_tests/sources/file_based/scenarios/incremental_scenarios.py
(15 hunks)unit_tests/sources/file_based/stream/test_default_file_based_stream.py
(5 hunks)unit_tests/sources/file_based/test_file_based_stream_reader.py
(4 hunks)unit_tests/sources/streams/concurrent/scenarios/stream_facade_scenarios.py
(8 hunks)unit_tests/sources/streams/concurrent/scenarios/thread_based_concurrent_stream_scenarios.py
(7 hunks)unit_tests/sources/streams/concurrent/test_concurrent_read_processor.py
(1 hunks)unit_tests/sources/streams/concurrent/test_default_stream.py
(5 hunks)
💤 Files with no reviewable changes (1)
- airbyte_cdk/models/file_transfer_record_message.py
🧰 Additional context used
🧬 Code Graph Analysis (11)
unit_tests/sources/streams/concurrent/test_concurrent_read_processor.py (1)
airbyte_cdk/sources/types.py (2)
file_reference
(43-44)file_reference
(47-48)
airbyte_cdk/sources/concurrent_source/concurrent_read_processor.py (1)
airbyte_cdk/sources/types.py (2)
file_reference
(43-44)file_reference
(47-48)
airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py (1)
airbyte_cdk/sources/utils/record_helper.py (1)
stream_data_to_airbyte_message
(20-53)
unit_tests/sources/streams/concurrent/test_default_stream.py (3)
airbyte_cdk/sources/streams/concurrent/default_stream.py (4)
as_airbyte_stream
(67-89)DefaultStream
(20-102)namespace
(53-54)name
(49-50)airbyte_cdk/sources/streams/concurrent/abstract_stream.py (2)
as_airbyte_stream
(80-83)name
(54-57)airbyte_cdk/sources/streams/concurrent/cursor.py (1)
FinalStateCursor
(85-124)
airbyte_cdk/test/mock_http/response_builder.py (1)
airbyte_cdk/test/utils/data.py (1)
get_unit_test_folder
(6-14)
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (5)
airbyte_cdk/sources/types.py (3)
Record
(21-72)data
(35-36)associated_slice
(39-40)airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (1)
stream_name
(301-302)airbyte_cdk/sources/streams/concurrent/adapters.py (1)
stream_name
(328-329)airbyte_cdk/sources/streams/concurrent/partitions/partition.py (1)
stream_name
(36-41)unit_tests/sources/streams/concurrent/scenarios/thread_based_concurrent_stream_source_builder.py (1)
stream_name
(120-121)
unit_tests/sources/file_based/in_memory_files_source.py (4)
airbyte_cdk/sources/declarative/retrievers/file_uploader.py (1)
upload
(45-89)airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
upload
(157-176)airbyte_cdk/sources/file_based/file_types/file_transfer.py (1)
upload
(18-30)unit_tests/sources/file_based/test_file_based_stream_reader.py (1)
upload
(85-88)
airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (3)
airbyte_cdk/sources/types.py (4)
data
(35-36)Record
(21-72)file_reference
(43-44)file_reference
(47-48)airbyte_cdk/sources/streams/concurrent/exceptions.py (1)
ExceptionWithDisplayMessage
(8-18)airbyte_cdk/sources/streams/concurrent/adapters.py (1)
stream_name
(328-329)
airbyte_cdk/sources/utils/record_helper.py (1)
airbyte_cdk/sources/types.py (3)
file_reference
(43-44)file_reference
(47-48)data
(35-36)
unit_tests/sources/declarative/file/test_file_stream.py (5)
airbyte_cdk/test/mock_http/response_builder.py (5)
Path
(31-41)find_binary_response
(201-206)find_template
(189-198)build
(145-146)build
(179-181)airbyte_cdk/sources/declarative/yaml_declarative_source.py (1)
YamlDeclarativeSource
(17-67)airbyte_cdk/test/catalog_builder.py (3)
CatalogBuilder
(48-81)ConfiguredAirbyteStreamBuilder
(13-45)with_name
(27-29)airbyte_cdk/test/entrypoint_wrapper.py (4)
EntrypointOutput
(49-152)discover
(186-203)read
(206-244)catalog
(113-117)airbyte_cdk/test/mock_http/mocker.py (1)
HttpMocker
(25-185)
airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (2)
airbyte_cdk/sources/declarative/retrievers/file_uploader.py (1)
FileUploader
(29-89)airbyte_cdk/sources/declarative/models/declarative_component_schema.py (17)
FileUploader
(2069-2091)Config
(132-133)Config
(146-147)Config
(160-161)Config
(174-175)Config
(192-193)Config
(206-207)Config
(220-221)Config
(234-235)Config
(248-249)Config
(262-263)Config
(276-277)Config
(290-291)Config
(306-307)Config
(320-321)Config
(334-335)Config
(368-369)
🔇 Additional comments (71)
airbyte_cdk/sources/utils/files_directory.py (1)
6-7
: Good environment variable fallback approach!The constants are well-defined with a reasonable fallback value for the staging directory.
pyproject.toml (1)
33-34
: Dependency update for protocol models looks goodUpdating
airbyte-protocol-models-dataclasses
to version^0.15
aligns with the protocol changes mentioned in the PR objectives, supporting the new file reference capabilities.unit_tests/sources/streams/concurrent/test_concurrent_read_processor.py (1)
89-89
: Test setup updated correctly for model changesThis change aligns with the update from boolean flag
is_file_transfer_message
to the more expressivefile_reference
attribute. TheNone
value correctly represents that this record does not reference a file.airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py (1)
64-64
: Updated to use new file reference approach.The removal of the explicit
is_file_transfer_message=False
parameter aligns with the broader refactoring that replaces boolean flags with structured file references. Since permission records are not file transfer messages, omitting this parameter allows the function to use its defaultfile_reference=None
implicitly, which is the correct behavior here. Nice cleanup!airbyte_cdk/models/__init__.py (1)
22-22
: Appropriate addition of AirbyteRecordMessageFileReference import.This addition correctly exposes the
AirbyteRecordMessageFileReference
class as part of the public API, which is essential for the file reference refactoring throughout the codebase. The import positioning maintains the alphabetical order of the imports.airbyte_cdk/sources/concurrent_source/concurrent_read_processor.py (1)
152-152
: Updated to use file reference instead of boolean flag.This change correctly updates the parameter from the deprecated
is_file_transfer_message
boolean flag to the newfile_reference
attribute, completing the transition to the structured file reference approach. This adjustment maintains consistency with the refactoredRecord
class andstream_data_to_airbyte_message
function.unit_tests/sources/streams/concurrent/scenarios/stream_facade_scenarios.py (1)
120-120
: The addition ofis_file_based: False
improves test case catalog consistency.These additions align the test scenarios with the new file transfer protocol support, ensuring that stream metadata explicitly indicates file transfer capabilities in the expected catalog. This properly validates that non-file-based streams will correctly expose this property in their stream metadata.
Also applies to: 166-166, 199-199, 233-233, 245-245, 277-277, 311-311, 345-345
unit_tests/sources/streams/concurrent/scenarios/thread_based_concurrent_stream_scenarios.py (1)
314-314
: The addition ofis_file_based: False
ensures consistent stream metadata across test scenarios.Similar to the changes in
stream_facade_scenarios.py
, these additions properly align the test scenarios with the file transfer protocol changes, making explicit that these test streams are not file-based. Good to see consistent implementation across related test files.Also applies to: 355-355, 435-435, 448-448, 488-488, 530-530, 572-572
airbyte_cdk/sources/declarative/stream_slicers/declarative_partition_generator.py (1)
60-70
: Preserving existing Record instances enables file reference metadata preservation.This is a key improvement that allows pre-existing
Record
objects to flow through the system without losing their metadata. Previously, all mappings would be wrapped in a new Record, potentially losing additional properties like file references.The change is particularly important for the new file upload functionality, ensuring file reference information is preserved when Records move through the streaming pipeline. Nice job on making this change type-safe with proper instance checking.
airbyte_cdk/models/airbyte_protocol.py (1)
85-85
: Simplified record type aligns with the new file transfer protocol.This change from a union type to just
AirbyteRecordMessage
reflects the architectural decision to eliminate the separateAirbyteFileTransferRecordMessage
class in favor of embedding file reference information directly in regular record messages.This approach is cleaner and more consistent, consolidating the message types while still supporting the file transfer functionality through the new
file_reference
property. The change successfully removes the legacy protocol implementation as intended in the PR objectives.unit_tests/sources/file_based/scenarios/incremental_scenarios.py (1)
95-95
: Looks good - explicit flag for file-based streams in catalog schemas.The addition of
"is_file_based": False
in all the test scenario stream schemas aligns with the broader changes to support the file transfer protocol. This makes the expected catalog definitions future-proof and explicit about the stream's capability.Also applies to: 176-176, 275-275, 336-336, 453-453, 554-554, 681-681, 758-758, 821-821, 900-900, 983-983, 1136-1136, 1267-1267, 1458-1458, 1642-1642, 1759-1759, 1900-1900
unit_tests/sources/file_based/in_memory_files_source.py (1)
143-146
: Method rename fromget_file
toupload
looks great.The method rename aligns with the broader refactoring in the codebase where file retrieval methods are renamed to better reflect their purpose in the new file transfer architecture. The implementation remains unchanged (returning an empty dict), which is appropriate for this test implementation.
airbyte_cdk/sources/declarative/extractors/record_selector.py (3)
18-18
: Import addition for FileUploader looks good.The import enables the integration with the new file uploader functionality.
46-46
: New optional FileUploader parameter makes sense.Adding an optional
file_uploader
field to the RecordSelector class enables file upload functionality without breaking existing implementations. The default ofNone
keeps backward compatibility.
122-125
: Elegant integration of file uploading during record processing.This change neatly integrates file uploading into the record processing flow. The implementation:
- Creates the record first
- Conditionally calls the uploader if it exists
- Yields the record with potential modifications from the upload process
This approach maintains backward compatibility while enabling the new functionality. Just one question - should we handle any exceptions from the upload process here, or is that handled inside the uploader? wdyt?
unit_tests/resource/http/response/file_api/articles.json (1)
1-37
: Test fixture for article API response looks well-structured.This JSON fixture provides a realistic API response for testing file upload functionality. It includes:
- Pagination metadata (
count
,next_page
)- A detailed article object with proper identifiers, timestamps, and metadata
- HTML content with an embedded image URL which can be used to test attachment handling
Good test data is crucial for robust testing of the file upload features.
airbyte_cdk/sources/streams/concurrent/default_stream.py (1)
32-32
: LGTM: Clean implementation of file transfer support.The addition of the
supports_file_transfer
parameter with a sensible default value (False) and the corresponding exposure via theis_file_based
property in theAirbyteStream
looks good. This aligns well with the PR's objective of implementing file upload functionality in API sources.Also applies to: 43-43, 73-73
unit_tests/sources/streams/concurrent/test_default_stream.py (2)
78-78
: Updates to existing tests correctly preserve expected behavior.Adding
is_file_based=False
to all the expectedAirbyteStream
instances ensures existing tests continue to pass with the new parameter. Good test maintenance!Also applies to: 116-116, 154-154, 192-192, 223-223
229-259
: Well-structured test for the new file transfer support.The new test correctly verifies that setting
supports_file_transfer=True
results in anAirbyteStream
withis_file_based=True
. This completes the test coverage for the changes to theDefaultStream
class.airbyte_cdk/sources/file_based/stream/concurrent/adapters.py (2)
7-7
: Simplified import - nice cleanup.Only importing what's actually used improves code clarity.
261-269
: Cleaner file record handling.The simplified approach to record data extraction and the switch from a boolean flag to a structured file reference improves code clarity and aligns with the broader refactoring in the PR.
unit_tests/sources/file_based/scenarios/csv_scenarios.py (2)
693-693
: Added new property for file transfer supportThe addition of
"is_file_based": False
here and in other test scenarios aligns with the file transfer protocol changes mentioned in the PR objectives. This will ensure test scenarios properly validate the file transfer capabilities.Just confirming - are you planning to add test scenarios for
"is_file_based": True
cases as well to validate both behavior modes? wdyt?
1144-1144
: Consistent file-based flag pattern applied to test scenariosI see you've systematically added the
"is_file_based": False
flag to multiple test scenarios, which ensures consistency across the testing framework. This supports the PR's goal of aligning with the updated file transfer record protocol.Would it make sense to create a helper constant or method for adding this property to streamline future additions? Just a thought!
Also applies to: 1232-1233, 2111-2112, 2196-2197, 2214-2215, 2633-2634
airbyte_cdk/sources/declarative/concurrent_declarative_source.py (3)
28-28
: Added FileUploader importThis import establishes support for the file uploader component in the declarative framework. It's a necessary foundation for the file upload functionality.
210-212
: Added detection for file transfer supportThe implementation checks if a stream supports file transfer by looking for a "file_uploader" key in the stream's configuration. This is a clean approach to detect this capability.
I'm curious about error handling - what happens if the file_uploader configuration is missing required parameters? Would additional validation be helpful here, or is that handled elsewhere? wdyt?
330-330
: Propagated file transfer support to DefaultStreamYou've consistently added the
supports_file_transfer
parameter to all DefaultStream constructor calls, ensuring the capability is properly transmitted to downstream components. This implementation allows for seamless file transfer support in the declarative framework.Also applies to: 362-363, 416-417
airbyte_cdk/sources/types.py (3)
9-9
: Updated imports for file reference supportAdded import for
AirbyteRecordMessageFileReference
to support the new file reference mechanism, replacing the older boolean flag approach. This is a good foundation for the enhanced file transfer support.
27-28
: Enhanced file transfer with structured referencesYou've replaced the simple boolean
is_file_transfer_message
with a more structuredfile_reference
parameter. This is a significant improvement that provides more flexibility and information about transferred files.The change from a boolean flag to a structured reference type allows for more metadata and capabilities. Nice enhancement!
Also applies to: 32-33
42-49
: Added getter and setter for file referenceThe property pattern implementation for
file_reference
follows Python best practices, providing a clean interface for accessing and modifying the file reference.Is there any validation needed in the setter to ensure the reference is properly formatted? Or is that handled at the
AirbyteRecordMessageFileReference
level? Just wondering!airbyte_cdk/sources/declarative/declarative_component_schema.yaml (1)
1451-1486
: New file uploader component looks great!The new
file_uploader
schema is well-structured and provides a clear path for defining file upload functionality in declarative sources. The required properties ensure the minimum configuration needed, while the optional properties allow for flexibility.I particularly like the detailed descriptions and examples provided for the
filename_extractor
- this will help users understand how to properly format their file paths.unit_tests/sources/file_based/test_file_based_stream_reader.py (2)
85-88
: Method signature updated to support new upload flowThe renaming from
get_file
toupload
aligns with the PR's goal of updating file-based connectors to use the new file transfer protocol.
449-458
: File path handling tests look comprehensiveThe updated tests for
_get_file_transfer_paths
cover the various configuration scenarios for directory structure preservation.These tests nicely verify that the path components are correctly constructed for different configuration options. I like how you're explicitly checking each component of the returned dictionary.
airbyte_cdk/sources/utils/record_helper.py (3)
12-12
: Good addition of the new file reference type import.The addition of
AirbyteRecordMessageFileReference
to the imports aligns well with the updated file transfer record protocol you're implementing.
25-25
: Nice improvement to the function signature.Replacing the boolean
is_file_transfer_message
with a typedfile_reference
parameter makes the code more type-safe and self-documenting. This change aligns well with the goal of refining file transfer handling.
39-44
: Clean implementation of the updated message creation logic.The updated implementation correctly uses the optional file_reference in the AirbyteRecordMessage constructor. This unifies the message creation logic and eliminates the need for a separate file transfer message type.
airbyte_cdk/sources/file_based/file_types/file_transfer.py (5)
5-5
: Good update to importing the tuple type.The import change reflects the updated return type using tuple for the file data and reference. This makes the API more structured and explicit.
7-11
: Nice import updates for the new data structures.The imports properly include the new
AirbyteRecordMessageFileReference
andFileRecordData
types, along with the centralizedget_files_directory
utility.
16-16
: Good centralization of directory path handling.Using the new
get_files_directory()
utility function provides a consistent way to initialize the local directory path across the codebase.
18-23
: Well-defined method signature update.Renaming
get_file
toupload
better reflects the purpose of the method. The return type is now a well-structured tuple ofFileRecordData
andAirbyteRecordMessageFileReference
instead of a dictionary, which provides better type safety.
25-25
: Consistent method name update.The call to the stream reader now uses
upload
instead ofget_file
, maintaining consistency with the method's new name.airbyte_cdk/sources/declarative/models/declarative_component_schema.py (3)
2069-2091
: Well-structured new FileUploader model.Great addition of the
FileUploader
model for declarative file upload support. The model includes all necessary components:
- A requester for HTTP requests
- Extract mechanisms for file locations and content
- Filename extraction capabilities
- Good documentation in the field descriptions
This addition enables the declarative source framework to support file uploads as described in the PR objectives.
2152-2156
: Good integration with the DeclarativeStream model.Adding the optional
file_uploader
field to theDeclarativeStream
model with the experimental marking is a clean way to introduce this new capability without disrupting existing implementations.
2650-2650
: Important forward reference update.Don't forget to add the
FileUploader.update_forward_refs()
call to ensure proper Pydantic model initialization.unit_tests/sources/file_based/stream/test_default_file_based_stream.py (6)
15-21
: Good import updates for test coverage.The imports now include the necessary types like
AirbyteRecordMessageFileReference
to support testing the new file transfer implementation.
34-34
: Proper import for the new file record data model.Adding the import for
FileRecordData
ensures the tests can properly work with the new structured data model.
286-296
: Excellent test data structure updates.Replacing the simple dictionary with proper
FileRecordData
andAirbyteRecordMessageFileReference
objects provides much better test coverage for the new file reference implementation.
326-330
: Updated test mocking to match the new API.The mock update correctly targets the renamed
upload
method and returns the properly structured tuple of file data and reference objects. Good consistency with the implementation changes.
334-344
: Well-updated assertions.The assertions now correctly check for the new file reference structure, ensuring that tests validate the actual behavior of the updated implementation.
476-557
: Excellent new test class for schema validation.The new
DefaultFileBasedStreamSchemaTest
provides great coverage for the stream schema behavior, testing both when file transfer is enabled and disabled. This ensures theis_file_based
flag is properly set on theAirbyteStream
objects.The tests are well-structured with clear mocking of dependencies and assertions that focus on the behavior being tested.
airbyte_cdk/sources/file_based/stream/default_file_based_stream.py (2)
154-161
: Bypass of validation / transformation whenuse_file_transfer
is trueInside the
use_file_transfer
branch you yield the rawfile_record_data.dict(...)
without passing it throughrecord_passes_validation_policy
ortransform_record
.Is the intent that
file_record_data
already contains the_ab_source_file_*
columns and schema‑validated payload? If not, we might emit inconsistent records compared to the non‑transfer path. Maybe we should still calltransform_record
(and possibly validation) on the dict prior to emitting, wdyt?
316-319
: Guard against missingis_file_based
field onAirbyteStream
as_airbyte_stream
mutatesfile_stream.is_file_based
, but the protocol model did not previously expose that attribute. Could you double‑check that the generatedAirbyteStream
class now contains it, otherwise a runtimeAttributeError
will surface?unit_tests/sources/declarative/file/file_stream_manifest.yaml (4)
3-4
: Nice type definition!The
DeclarativeSource
type is properly specified here. This is a good example of how to structure a version 2.0.0 declarative source manifest.
25-30
: Good use of SelectiveAuthenticator!The selective authenticator pattern allows the connector to handle multiple authentication methods elegantly. This provides flexibility for users without complicating the implementation.
117-149
: Well-designed parent-child stream relationship.The
SubstreamPartitionRouter
is effectively used to create a dependency between the article_attachments stream and the articles stream. Theincremental_dependency: true
setting ensures that only attachments for articles that have changed will be synced, which is an efficient design pattern.
149-164
: Great implementation of file uploader!The file uploader component is well-structured with:
- A properly configured HTTP requester that inherits authentication from the main API
- A clear download target extractor pointing to "content_url"
- Reuse of the same authentication selection logic
This showcases the new file transfer capabilities nicely. The component will download files from URLs extracted from API responses while maintaining proper authentication.
unit_tests/sources/declarative/file/test_file_stream.py (5)
19-29
: Nice config builder implementation.The
ConfigBuilder
provides a clean way to generate test configurations with all required fields. This simplifies the test methods and makes the code more maintainable.
32-46
: Good source factory function.The
_source
helper function centralizes the creation of the declarative source, making tests more readable and reducing duplication. The default yaml_file parameter is a nice touch that simplifies the common case while allowing override for special tests.
81-97
: Well-structured connection test.The test correctly mocks the HTTP request to the articles endpoint and verifies that the check operation succeeds. This ensures that the connection check logic works properly with the new file transfer capabilities.
114-152
: Great file reference validation!This test thoroughly validates all aspects of the file reference:
- Presence of the file reference in the record
- Validation of the staging file URL format with regex
- Verification of the source file relative path
- Confirmation that file size is captured
The UUID pattern validation is particularly good for ensuring proper file path generation.
193-201
: Good discovery test.The test confirms that streams with file uploaders are correctly marked as file-based during discovery. This ensures that the platform correctly identifies file-based streams for special processing.
unit_tests/sources/declarative/file/test_file_stream_with_filename_extractor.yaml (1)
149-164
: Well-structured file uploader with filename extraction.The file uploader configuration is well-designed, reusing the authentication pattern from the main API and clearly defining the extraction paths.
The
filename_extractor
on line 164 uses a template to generate custom filenames based on record fields, which is a powerful feature. Just checking - is the trailing slash after{{ record.file_name }}
intentional? It could create a directory structure with an empty directory name at the end. wdyt?airbyte_cdk/sources/declarative/parsers/model_to_component_factory.py (10)
230-232
: Clean import addition for FileUploader model.Good job adding the import for the FileUploader model which defines the schema for the declarative component.
484-484
: Appropriate import for the FileUploader implementation class.The import for the actual FileUploader class implementation looks good. This follows the established pattern of separating models and their implementations.
682-682
: Well-placed mapping entry for the FileUploader factory method.This mapping links the FileUploaderModel to the create_file_uploader method that will instantiate the runtime component. Nicely added at the appropriate location in the dictionary.
1845-1849
: Good implementation of file_uploader detection and creation in create_declarative_stream.You've properly implemented the logic to detect and instantiate the file_uploader component when present in the stream model. This follows the established pattern for handling optional components.
1861-1861
: Correctly propagated file_uploader to the retriever.The file_uploader is correctly passed to the retriever creation, ensuring it's available for use during stream retrieval.
2807-2807
: Properly updated create_record_selector signature with file_uploader parameter.You've correctly updated the method signature to accept an optional file_uploader parameter, which is necessary for propagating the component through the creation hierarchy.
2845-2846
: Good integration of file_uploader in RecordSelector construction.The file_uploader is correctly passed to the RecordSelector constructor, enabling it to be used during record processing.
2903-2904
: Well-updated create_simple_retriever signature with file_uploader parameter.The method signature is correctly updated to accept and propagate the optional file_uploader parameter.
2924-2925
: Correctly passed file_uploader to record_selector in create_simple_retriever.The file_uploader is properly passed to the record_selector creation, maintaining the component chain.
3591-3613
: Well-implemented create_file_uploader method.The implementation of create_file_uploader follows the established patterns for component creation:
- Creates required sub-components (requester and download_target_extractor)
- Correctly handles parameters and configuration
- Properly handles optional filename_extractor
- Returns a fully initialized FileUploader instance
This is a clean implementation that aligns with the factory's existing coding style and patterns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
APPROVED
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🔭 Outside diff range comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
1-212
: 💡 Verification agent🧩 Analysis chain
Format issue flagged by pipeline
The linter found a formatting issue with this file. Running the formatter would help ensure consistent code style.
🏁 Script executed:
#!/bin/bash # Display formatting issues with the file rg "^from typing import" airbyte_cdk/sources/file_based/file_based_stream_reader.py -A 2 -B 2Length of output: 280
Could we verify the formatting with Black to ensure consistency? wdyt?
🏁 Script executed:
#!/bin/bash # Install Black if needed and check formatting diffs pip install black black --check --diff airbyte_cdk/sources/file_based/file_based_stream_reader.pyLength of output: 1361
Could we try running Black via the Python module to bypass the PATH issue? wdyt?
🏁 Script executed:
#!/bin/bash # Check formatting using Black through python -m python3 -m black --check --diff airbyte_cdk/sources/file_based/file_based_stream_reader.pyLength of output: 2570
Apply Black formatting to file_based_stream_reader.py
Black reports that
airbyte_cdk/sources/file_based/file_based_stream_reader.py
would be reformatted. Could you run:black airbyte_cdk/sources/file_based/file_based_stream_reader.py
to ensure consistent styling? wdyt?
🧰 Tools
🪛 GitHub Actions: Linters
[error] 1-1: Ruff formatting check failed. File would be reformatted.
♻️ Duplicate comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
203-203
:⚠️ Potential issuePotential bug with empty directory paths
This line could raise a
FileNotFoundError
whensource_file_relative_path
is a bare filename andpreserve_directory_structure
isFalse
, aspath.dirname(local_file_path)
would return an empty string.Could we guard against this scenario by checking if the directory name is empty?
- makedirs(path.dirname(local_file_path), exist_ok=True) + dir_name = path.dirname(local_file_path) + if dir_name: + makedirs(dir_name, exist_ok=True)
🧹 Nitpick comments (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (1)
171-175
: Updated docstring for better specificityThe docstring now clearly describes the return value fields in the
AirbyteRecordMessageFileReference
object. Could we enhance it further by also mentioning the first part of the tuple (FileRecordData
)?Returns: - AirbyteRecordMessageFileReference: A file reference object containing: - - staging_file_url (str): The absolute path to the referenced file in the staging area. - - file_size_bytes (int): The size of the referenced file in bytes. - - source_file_relative_path (str): The relative path to the referenced file in source. + Tuple[FileRecordData, AirbyteRecordMessageFileReference]: A tuple containing: + - FileRecordData: Object with file metadata (folder, filename, bytes, source_uri, etc.) + - AirbyteRecordMessageFileReference: A file reference object containing: + - staging_file_url (str): The absolute path to the referenced file in the staging area. + - file_size_bytes (int): The size of the referenced file in bytes. + - source_file_relative_path (str): The relative path to the referenced file in source.wdyt?
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
airbyte_cdk/sources/declarative/retrievers/file_uploader.py
(1 hunks)airbyte_cdk/sources/file_based/file_based_stream_reader.py
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- airbyte_cdk/sources/declarative/retrievers/file_uploader.py
🧰 Additional context used
🧬 Code Graph Analysis (1)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (5)
airbyte_cdk/sources/file_based/config/validate_config_transfer_modes.py (3)
include_identities_stream
(65-81)preserve_directory_structure
(26-45)use_file_transfer
(18-23)airbyte_cdk/sources/file_based/file_record_data.py (1)
FileRecordData
(11-23)airbyte_cdk/sources/declarative/retrievers/file_uploader.py (1)
upload
(45-93)airbyte_cdk/sources/file_based/file_types/file_transfer.py (1)
upload
(18-30)airbyte_cdk/sources/file_based/remote_file.py (1)
RemoteFile
(11-18)
🪛 GitHub Actions: Linters
airbyte_cdk/sources/file_based/file_based_stream_reader.py
[error] 1-1: Ruff formatting check failed. File would be reformatted.
⏰ Context from checks skipped due to timeout of 90000ms (8)
- GitHub Check: Check: 'source-pokeapi' (skip=false)
- GitHub Check: Check: 'source-amplitude' (skip=false)
- GitHub Check: Check: 'source-shopify' (skip=false)
- GitHub Check: Check: 'source-hardcoded-records' (skip=false)
- GitHub Check: Pytest (All, Python 3.11, Ubuntu)
- GitHub Check: Pytest (All, Python 3.10, Ubuntu)
- GitHub Check: Pytest (Fast)
- GitHub Check: SDM Docker Image Build
🔇 Additional comments (5)
airbyte_cdk/sources/file_based/file_based_stream_reader.py (5)
11-11
: Added Tuple to typing imports - great addition for type safetyThe addition of
Tuple
to the typing imports will help with the updated return type in theupload
method. Good practice for maintaining type safety!
33-36
: New class constants improve code readabilityAdding these string constants as class attributes is a good practice as it makes the code more maintainable and prevents string literal duplication. These constants clearly define the keys used in the file paths dictionary.
157-159
: Method renamed fromget_file
toupload
with improved return typeRenaming to
upload
better represents the method's purpose, and the return type is now explicitly a tuple ofFileRecordData
andAirbyteRecordMessageFileReference
types, which aligns with the new file reference protocol.
178-191
: Refactored method with improved parameter names and validationThe method signature is clearer now, taking specific
source_file_relative_path
andstaging_directory
parameters instead of aRemoteFile
object. The added validation for staging directory existence is a good safeguard.
205-211
: Dictionary return value improves structure and reusabilityUsing a dictionary with named keys (leveraging the new class constants) makes the return value more structured and self-documenting. This approach is more maintainable than returning a list of values.
/autofix
|
Error for source-google-drive is expected, we have PRs to update all file-based connectors:
|
Context
This pr summarizes the changes in:
What
File API support and update the file-based sources to the latest protocol implementations.
Api Sources
Introduce a new file upload component.
File Bases Sources: Remove Legacy Hacked Protocol for file-based connectors and introduce latest protocol changes
This PR updates the file-based and file uploader components in the Airbyte Python CDK to align with the file transfer record protocol. It introduces schema refinements, file path handling improvements, and new test cases.
Resolves https://github.com/airbytehq/airbyte-internal-issues/issues/12364
How
Review guide
File-api changes:
airbyte_cdk/sources/declarative/retrievers/file_uploader.py
: newest cool component to upload documents for file API streams .Remove Legacy Hacked Protocol for file based connectors and introduce latest protocol changes
File based changes:
airbyte_cdk/models/airbyte_protocol.py
: remove hacked protocolairbyte_cdk/models/file_transfer_record_message.py
: remove hacked protocolairbyte_cdk/sources/concurrent_source/concurrent_read_processor.py
: remove hacked protocolairbyte_cdk/sources/file_based/file_based_stream_reader.py
: change method verb and return type to AirbyteRecordMessageFileReference, also make _get_file_transfer_paths support method return a dict with path fields.airbyte_cdk/sources/file_based/file_record_data.py
: helper model for record (metadata) of files.airbyte_cdk/sources/file_based/file_types/file_transfer.py
: update to return record and file reference data.airbyte_cdk/sources/file_based/schema_helpers.py
: schema of records (metadata) for file-based connectors.airbyte_cdk/sources/file_based/stream/concurrent/adapters.py
: pass file_referenceairbyte_cdk/sources/file_based/stream/default_file_based_stream.py
: introduce changes to default file based stream to handle new file reference and records data besides fixed schema.airbyte_cdk/sources/file_based/stream/permissions_file_based_stream.py
: update call tostream_data_to_airbyte_message
airbyte_cdk/sources/types.py
: remove oldis_file_transfer_message
flagairbyte_cdk/sources/utils/record_helper.py
: remove handling ofis_file_transfer_message
flagairbyte_cdk/test/mock_http/response_builder.py
: add helper method to get binary data from file for testingUser Impact
Developers using the file-based CDK and file uploader in declarative functionality will benefit from file_reference protocol support.
Can this PR be safely reverted and rolled back?
Summary by CodeRabbit
Summary by CodeRabbit
New Features
Bug Fixes
Tests
Chores