Skip to content

[Storage] Add support for Arrow in List Blobs#46276

Draft
weirongw23-msft wants to merge 12 commits intoAzure:mainfrom
weirongw23-msft:weirongw23/photon
Draft

[Storage] Add support for Arrow in List Blobs#46276
weirongw23-msft wants to merge 12 commits intoAzure:mainfrom
weirongw23-msft:weirongw23/photon

Conversation

@weirongw23-msft
Copy link
Copy Markdown
Member

Most of the tests will not work because the current test accounts are not working and I cannot record them. However, I made mocked tests to verify XML and Arrow parsing, as well as pagination.

@weirongw23-msft weirongw23-msft marked this pull request as ready for review April 13, 2026 07:54
Copilot AI review requested due to automatic review settings April 13, 2026 07:54
@github-actions github-actions bot added the Storage Storage Service (Queues, Blobs, Files) label Apr 13, 2026
("VersionId", "version_id", None),
("IsCurrentVersion", "is_current_version", None),
("Etag", "etag", None),
("Deleted", "deleted", False),
Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to check default values

might be an easier to way to do this but there's a lot of traps in .as_py() call before from pyarrow, because if the field is empty, it may crash the program

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds Apache Arrow-backed listing support to azure-storage-blob by introducing new generated container operations that request Arrow IPC streams, plus client-side paging helpers to parse Arrow responses (with XML fallback), along with new sync/async tests.

Changes:

  • Added use_arrow (and end_before) support to ContainerClient.list_blobs / walk_blobs and their async equivalents, selecting new *_apache_arrow operations and Arrow-aware page iterators.
  • Implemented Arrow IPC stream parsing to convert Arrow batches into BlobProperties instances.
  • Added new sync/async tests covering paging behavior and mocked XML/Arrow parsing.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 17 comments.

Show a summary per file
File Description
sdk/storage/azure-storage-blob/tests/test_arrow.py New sync tests for Arrow listing + pagination + mocked XML/Arrow parsing.
sdk/storage/azure-storage-blob/tests/test_arrow_async.py New async tests for Arrow listing + pagination + mocked XML/Arrow parsing.
sdk/storage/azure-storage-blob/setup.py Adds pyarrow as a dependency.
sdk/storage/azure-storage-blob/azure/storage/blob/_list_blobs_helper.py Adds Arrow response parsing and Arrow-aware page iterators (sync).
sdk/storage/azure-storage-blob/azure/storage/blob/_container_client.py Adds use_arrow/end_before plumbing for sync list_blobs/walk_blobs.
sdk/storage/azure-storage-blob/azure/storage/blob/aio/_list_blobs_helper.py Adds Arrow-aware async page iterators (async).
sdk/storage/azure-storage-blob/azure/storage/blob/aio/_container_client_async.py Adds use_arrow/end_before plumbing for async list_blobs/walk_blobs.
sdk/storage/azure-storage-blob/azure/storage/blob/_generated/operations/_container_operations.py Adds Arrow request builders and new *_apache_arrow sync operations.
sdk/storage/azure-storage-blob/azure/storage/blob/_generated/aio/operations/_container_operations.py Adds new *_apache_arrow async operations and request imports.
sdk/storage/azure-storage-blob/azure/storage/blob/_container_client.pyi Updates type stub for list_blobs (adds use_arrow, end_before).
sdk/storage/azure-storage-blob/azure/apiview-properties.json Adds new operation entries for APIView configuration.

Comment on lines 81 to 87
install_requires=[
"azure-core>=1.37.0",
"cryptography>=2.1.4",
"typing-extensions>=4.6.0",
"isodate>=0.6.1"
"isodate>=0.6.1",
"pyarrow>=23.0.1",
],
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding pyarrow to install_requires makes it a mandatory dependency of azure-storage-blob, which is a large native package and can break installation on platforms/Python versions where wheels aren’t available. Consider making Arrow support an optional extra (e.g., extras_require["arrow"]) and keeping the runtime ImportError guidance aligned with that extra, rather than requiring pyarrow for all users.

Copilot uses AI. Check for mistakes.
Comment on lines 10 to 15
from datetime import datetime
from types import SimpleNamespace
from typing import (
Any, AnyStr, cast, Dict, List, IO, Iterable, Iterator, Optional, overload, Union,
Any, AnyStr, cast, Dict, List, IO, Iterable, Iterator, Optional, overload, Tuple, Union,
TYPE_CHECKING
)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SimpleNamespace and Tuple are imported but not used in this module, which will trigger linting failures and adds noise. Please remove the unused imports (or use them if they were intended for the Arrow implementation).

Copilot uses AI. Check for mistakes.
Comment on lines +57 to +60
return_context_and_deserialized,
return_response_headers
)
from ._shared.uploads import IterStreamer
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return_context_and_deserialized and IterStreamer are imported here but not referenced anywhere in the file. Please remove these unused imports to avoid linting issues and keep the module surface area minimal.

Suggested change
return_context_and_deserialized,
return_response_headers
)
from ._shared.uploads import IterStreamer
return_response_headers
)

Copilot uses AI. Check for mistakes.
Comment on lines +848 to 852
self._client.container.list_blob_flat_segment_apache_arrow
if use_arrow else self._client.container.list_blob_flat_segment,
include=include,
timeout=timeout,
**kwargs
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end_before is documented as a supported keyword here, but it’s never popped/validated before **kwargs is forwarded into the generated operation. If a caller passes end_before with use_arrow=False (or omits use_arrow), it will be forwarded to list_blob_flat_segment(...), which doesn’t accept that parameter and will raise TypeError. Please pop end_before and only pass it to the Apache Arrow operation (or raise a clear ValueError when end_before is provided without use_arrow=True).

Copilot uses AI. Check for mistakes.
Comment on lines 991 to +995
self._client.container.list_blob_hierarchy_segment,
delimiter=delimiter,
include=include,
timeout=timeout,
**kwargs)
**kwargs
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

end_before is mentioned in the docstring, but it isn’t removed from kwargs when use_arrow is false. That means end_before will be forwarded via **kwargs into list_blob_hierarchy_segment(...), which doesn’t accept it and will raise TypeError. Please pop/validate end_before and only forward it for the Arrow request path (or fail fast with a clear error when end_before is provided without use_arrow=True).

Copilot uses AI. Check for mistakes.
Comment on lines +1932 to +1953
"""[Update] The List Blobs operation returns a list of the blobs under the specified container.
This operation is for Apache Arrow use case so response is returned as raw to be deserialized
by the client.

:param delimiter: When the request includes this parameter, the operation returns a BlobPrefix
element in the response body that acts as a placeholder for all blobs whose names begin with
the same substring up to the appearance of the delimiter character. The delimiter may be a
single character or a string. Required.
:type delimiter: str
:param prefix: Filters the results to return only containers whose name begins with the
specified prefix. Default value is None.
:type prefix: str
:param marker: A string value that identifies the portion of the list of containers to be
returned with the next listing operation. The operation returns the NextMarker value within the
response body if the listing operation did not return all containers remaining to be listed
with the current page. The NextMarker value can be used as the value for the marker parameter
in a subsequent call to request the next page of list items. The marker value is opaque to the
client. Default value is None.
:type marker: str
:param maxresults: Specifies the maximum number of containers to return. If the request does
not specify maxresults, or specifies a value greater than 5000, the server will return up to
5000 items. Note that if the listing operation crosses a partition boundary, then the service
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the async generated list_blob_hierarchy_segment_apache_arrow docstring, there’s a stray "[Update]" prefix and multiple parameter descriptions still refer to “containers” instead of blobs (prefix, marker, maxresults). Please clean up the docstring so it accurately describes the blob listing operation.

Copilot uses AI. Check for mistakes.
Comment on lines +265 to +270
raw_bytes = b"".join(deserialized)
next_marker, blob_items = _parse_arrow_response(raw_bytes, self.container)
self._arrow_response = (next_marker, blob_items)
return location_mode, raw_bytes
pipeline_response.http_response.load_body()
xml_response = self._deserializer("ListBlobsFlatSegmentResponse", pipeline_response.http_response)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the async paging helper, pipeline_response.http_response.load_body() is called without await. For async transports, load_body() is typically an async coroutine, so this can leave the body unread (and may emit an un-awaited coroutine warning), breaking the XML fallback path. Consider restructuring the async Arrow operation/paging flow so the XML fallback reads the body via an awaited call (e.g., reading in the generated operation when content-type isn’t Arrow, or avoiding stream=True since the Arrow path already buffers the full response).

Copilot uses AI. Check for mistakes.
Comment on lines +261 to +267
def _arrow_cls(self, pipeline_response, deserialized, response_headers):
content_type = response_headers.get("Content-Type", "")
location_mode = getattr(pipeline_response.http_response, "location_mode", None)
if _ARROW_CONTENT_TYPE in content_type:
raw_bytes = b"".join(deserialized)
next_marker, blob_items = _parse_arrow_response(raw_bytes, self.container)
self._arrow_response = (next_marker, blob_items)
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deserialized passed into _arrow_cls comes from AsyncHttpResponse.stream_download(...) (see the generated async operation), which is an async iterator. Calling b"".join(deserialized) will fail at runtime because an async iterator isn’t directly iterable. The Arrow bytes should be collected using async for (likely in an async method like _extract_data_cb) or by switching the generated Arrow operation to stream=False and await response.read() to obtain full bytes before invoking cls.

Copilot uses AI. Check for mistakes.
async def fake_list_blob_flat_segment_apache_arrow(**kwargs):
nonlocal page_index
cls = kwargs.get("cls")
raw = iter([pages[page_index]])
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the async mock for the Arrow response, raw = iter([pages[page_index]]) produces a synchronous iterator. In real async runs, response.stream_download(...) yields an async iterator, so this mock won’t catch issues around consuming the stream (e.g., needing async for). Consider returning an async iterator here to better match the real transport behavior.

Suggested change
raw = iter([pages[page_index]])
async def _raw_pages():
yield pages[page_index]
raw = _raw_pages()

Copilot uses AI. Check for mistakes.
Comment on lines +471 to +472
raw = iter([pages[page_index]])
page_index += 1
Copy link

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This async mock uses raw = iter([pages[page_index]]), but the real async pipeline returns an async iterator from stream_download. Using a synchronous iterator here can mask bugs in the Arrow parsing path. Please consider using an async iterator in the mock so the test exercises the same consumption pattern as production code.

Suggested change
raw = iter([pages[page_index]])
page_index += 1
page = pages[page_index]
page_index += 1
async def _raw():
yield page
raw = _raw()

Copilot uses AI. Check for mistakes.
@jalauzon-msft jalauzon-msft changed the title [Storage] Project Photon [Storage] Add support for Arrow in List Blobs Apr 13, 2026
@weirongw23-msft weirongw23-msft marked this pull request as draft April 13, 2026 18:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Storage Storage Service (Queues, Blobs, Files)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants