[Storage] Add support for Arrow in List Blobs#46276
[Storage] Add support for Arrow in List Blobs#46276weirongw23-msft wants to merge 12 commits intoAzure:mainfrom
Conversation
| ("VersionId", "version_id", None), | ||
| ("IsCurrentVersion", "is_current_version", None), | ||
| ("Etag", "etag", None), | ||
| ("Deleted", "deleted", False), |
There was a problem hiding this comment.
need to check default values
might be an easier to way to do this but there's a lot of traps in .as_py() call before from pyarrow, because if the field is empty, it may crash the program
There was a problem hiding this comment.
Pull request overview
This PR adds Apache Arrow-backed listing support to azure-storage-blob by introducing new generated container operations that request Arrow IPC streams, plus client-side paging helpers to parse Arrow responses (with XML fallback), along with new sync/async tests.
Changes:
- Added
use_arrow(andend_before) support toContainerClient.list_blobs/walk_blobsand their async equivalents, selecting new*_apache_arrowoperations and Arrow-aware page iterators. - Implemented Arrow IPC stream parsing to convert Arrow batches into
BlobPropertiesinstances. - Added new sync/async tests covering paging behavior and mocked XML/Arrow parsing.
Reviewed changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated 17 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/storage/azure-storage-blob/tests/test_arrow.py | New sync tests for Arrow listing + pagination + mocked XML/Arrow parsing. |
| sdk/storage/azure-storage-blob/tests/test_arrow_async.py | New async tests for Arrow listing + pagination + mocked XML/Arrow parsing. |
| sdk/storage/azure-storage-blob/setup.py | Adds pyarrow as a dependency. |
| sdk/storage/azure-storage-blob/azure/storage/blob/_list_blobs_helper.py | Adds Arrow response parsing and Arrow-aware page iterators (sync). |
| sdk/storage/azure-storage-blob/azure/storage/blob/_container_client.py | Adds use_arrow/end_before plumbing for sync list_blobs/walk_blobs. |
| sdk/storage/azure-storage-blob/azure/storage/blob/aio/_list_blobs_helper.py | Adds Arrow-aware async page iterators (async). |
| sdk/storage/azure-storage-blob/azure/storage/blob/aio/_container_client_async.py | Adds use_arrow/end_before plumbing for async list_blobs/walk_blobs. |
| sdk/storage/azure-storage-blob/azure/storage/blob/_generated/operations/_container_operations.py | Adds Arrow request builders and new *_apache_arrow sync operations. |
| sdk/storage/azure-storage-blob/azure/storage/blob/_generated/aio/operations/_container_operations.py | Adds new *_apache_arrow async operations and request imports. |
| sdk/storage/azure-storage-blob/azure/storage/blob/_container_client.pyi | Updates type stub for list_blobs (adds use_arrow, end_before). |
| sdk/storage/azure-storage-blob/azure/apiview-properties.json | Adds new operation entries for APIView configuration. |
| install_requires=[ | ||
| "azure-core>=1.37.0", | ||
| "cryptography>=2.1.4", | ||
| "typing-extensions>=4.6.0", | ||
| "isodate>=0.6.1" | ||
| "isodate>=0.6.1", | ||
| "pyarrow>=23.0.1", | ||
| ], |
There was a problem hiding this comment.
Adding pyarrow to install_requires makes it a mandatory dependency of azure-storage-blob, which is a large native package and can break installation on platforms/Python versions where wheels aren’t available. Consider making Arrow support an optional extra (e.g., extras_require["arrow"]) and keeping the runtime ImportError guidance aligned with that extra, rather than requiring pyarrow for all users.
| from datetime import datetime | ||
| from types import SimpleNamespace | ||
| from typing import ( | ||
| Any, AnyStr, cast, Dict, List, IO, Iterable, Iterator, Optional, overload, Union, | ||
| Any, AnyStr, cast, Dict, List, IO, Iterable, Iterator, Optional, overload, Tuple, Union, | ||
| TYPE_CHECKING | ||
| ) |
There was a problem hiding this comment.
SimpleNamespace and Tuple are imported but not used in this module, which will trigger linting failures and adds noise. Please remove the unused imports (or use them if they were intended for the Arrow implementation).
| return_context_and_deserialized, | ||
| return_response_headers | ||
| ) | ||
| from ._shared.uploads import IterStreamer |
There was a problem hiding this comment.
return_context_and_deserialized and IterStreamer are imported here but not referenced anywhere in the file. Please remove these unused imports to avoid linting issues and keep the module surface area minimal.
| return_context_and_deserialized, | |
| return_response_headers | |
| ) | |
| from ._shared.uploads import IterStreamer | |
| return_response_headers | |
| ) |
| self._client.container.list_blob_flat_segment_apache_arrow | ||
| if use_arrow else self._client.container.list_blob_flat_segment, | ||
| include=include, | ||
| timeout=timeout, | ||
| **kwargs |
There was a problem hiding this comment.
end_before is documented as a supported keyword here, but it’s never popped/validated before **kwargs is forwarded into the generated operation. If a caller passes end_before with use_arrow=False (or omits use_arrow), it will be forwarded to list_blob_flat_segment(...), which doesn’t accept that parameter and will raise TypeError. Please pop end_before and only pass it to the Apache Arrow operation (or raise a clear ValueError when end_before is provided without use_arrow=True).
| self._client.container.list_blob_hierarchy_segment, | ||
| delimiter=delimiter, | ||
| include=include, | ||
| timeout=timeout, | ||
| **kwargs) | ||
| **kwargs |
There was a problem hiding this comment.
end_before is mentioned in the docstring, but it isn’t removed from kwargs when use_arrow is false. That means end_before will be forwarded via **kwargs into list_blob_hierarchy_segment(...), which doesn’t accept it and will raise TypeError. Please pop/validate end_before and only forward it for the Arrow request path (or fail fast with a clear error when end_before is provided without use_arrow=True).
| """[Update] The List Blobs operation returns a list of the blobs under the specified container. | ||
| This operation is for Apache Arrow use case so response is returned as raw to be deserialized | ||
| by the client. | ||
|
|
||
| :param delimiter: When the request includes this parameter, the operation returns a BlobPrefix | ||
| element in the response body that acts as a placeholder for all blobs whose names begin with | ||
| the same substring up to the appearance of the delimiter character. The delimiter may be a | ||
| single character or a string. Required. | ||
| :type delimiter: str | ||
| :param prefix: Filters the results to return only containers whose name begins with the | ||
| specified prefix. Default value is None. | ||
| :type prefix: str | ||
| :param marker: A string value that identifies the portion of the list of containers to be | ||
| returned with the next listing operation. The operation returns the NextMarker value within the | ||
| response body if the listing operation did not return all containers remaining to be listed | ||
| with the current page. The NextMarker value can be used as the value for the marker parameter | ||
| in a subsequent call to request the next page of list items. The marker value is opaque to the | ||
| client. Default value is None. | ||
| :type marker: str | ||
| :param maxresults: Specifies the maximum number of containers to return. If the request does | ||
| not specify maxresults, or specifies a value greater than 5000, the server will return up to | ||
| 5000 items. Note that if the listing operation crosses a partition boundary, then the service |
There was a problem hiding this comment.
In the async generated list_blob_hierarchy_segment_apache_arrow docstring, there’s a stray "[Update]" prefix and multiple parameter descriptions still refer to “containers” instead of blobs (prefix, marker, maxresults). Please clean up the docstring so it accurately describes the blob listing operation.
| raw_bytes = b"".join(deserialized) | ||
| next_marker, blob_items = _parse_arrow_response(raw_bytes, self.container) | ||
| self._arrow_response = (next_marker, blob_items) | ||
| return location_mode, raw_bytes | ||
| pipeline_response.http_response.load_body() | ||
| xml_response = self._deserializer("ListBlobsFlatSegmentResponse", pipeline_response.http_response) |
There was a problem hiding this comment.
In the async paging helper, pipeline_response.http_response.load_body() is called without await. For async transports, load_body() is typically an async coroutine, so this can leave the body unread (and may emit an un-awaited coroutine warning), breaking the XML fallback path. Consider restructuring the async Arrow operation/paging flow so the XML fallback reads the body via an awaited call (e.g., reading in the generated operation when content-type isn’t Arrow, or avoiding stream=True since the Arrow path already buffers the full response).
| def _arrow_cls(self, pipeline_response, deserialized, response_headers): | ||
| content_type = response_headers.get("Content-Type", "") | ||
| location_mode = getattr(pipeline_response.http_response, "location_mode", None) | ||
| if _ARROW_CONTENT_TYPE in content_type: | ||
| raw_bytes = b"".join(deserialized) | ||
| next_marker, blob_items = _parse_arrow_response(raw_bytes, self.container) | ||
| self._arrow_response = (next_marker, blob_items) |
There was a problem hiding this comment.
deserialized passed into _arrow_cls comes from AsyncHttpResponse.stream_download(...) (see the generated async operation), which is an async iterator. Calling b"".join(deserialized) will fail at runtime because an async iterator isn’t directly iterable. The Arrow bytes should be collected using async for (likely in an async method like _extract_data_cb) or by switching the generated Arrow operation to stream=False and await response.read() to obtain full bytes before invoking cls.
| async def fake_list_blob_flat_segment_apache_arrow(**kwargs): | ||
| nonlocal page_index | ||
| cls = kwargs.get("cls") | ||
| raw = iter([pages[page_index]]) |
There was a problem hiding this comment.
In the async mock for the Arrow response, raw = iter([pages[page_index]]) produces a synchronous iterator. In real async runs, response.stream_download(...) yields an async iterator, so this mock won’t catch issues around consuming the stream (e.g., needing async for). Consider returning an async iterator here to better match the real transport behavior.
| raw = iter([pages[page_index]]) | |
| async def _raw_pages(): | |
| yield pages[page_index] | |
| raw = _raw_pages() |
| raw = iter([pages[page_index]]) | ||
| page_index += 1 |
There was a problem hiding this comment.
This async mock uses raw = iter([pages[page_index]]), but the real async pipeline returns an async iterator from stream_download. Using a synchronous iterator here can mask bugs in the Arrow parsing path. Please consider using an async iterator in the mock so the test exercises the same consumption pattern as production code.
| raw = iter([pages[page_index]]) | |
| page_index += 1 | |
| page = pages[page_index] | |
| page_index += 1 | |
| async def _raw(): | |
| yield page | |
| raw = _raw() |
Most of the tests will not work because the current test accounts are not working and I cannot record them. However, I made mocked tests to verify XML and Arrow parsing, as well as pagination.