[Storage] Add support for Arrow in List Blobs by weirongw23-msft · Pull Request #46276 · Azure/azure-sdk-for-python

weirongw23-msft · 2026-04-13T07:54:22Z

Most of the tests will not work because the current test accounts are not working and I cannot record them. However, I made mocked tests to verify XML and Arrow parsing, as well as pagination.

weirongw23-msft · 2026-04-13T07:59:37Z

sdk/storage/azure-storage-blob/azure/storage/blob/_list_blobs_helper.py

+        ("VersionId",                "version_id",                None),
+        ("IsCurrentVersion",         "is_current_version",        None),
+        ("Etag",                     "etag",                      None),
+        ("Deleted",                  "deleted",                   False),


need to check default values

might be an easier to way to do this but there's a lot of traps in .as_py() call before from pyarrow, because if the field is empty, it may crash the program

Copilot

Pull request overview

This PR adds Apache Arrow-backed listing support to azure-storage-blob by introducing new generated container operations that request Arrow IPC streams, plus client-side paging helpers to parse Arrow responses (with XML fallback), along with new sync/async tests.

Changes:

Added use_arrow (and end_before) support to ContainerClient.list_blobs / walk_blobs and their async equivalents, selecting new *_apache_arrow operations and Arrow-aware page iterators.
Implemented Arrow IPC stream parsing to convert Arrow batches into BlobProperties instances.
Added new sync/async tests covering paging behavior and mocked XML/Arrow parsing.

Reviewed changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated 17 comments.

Show a summary per file

File	Description
sdk/storage/azure-storage-blob/tests/test_arrow.py	New sync tests for Arrow listing + pagination + mocked XML/Arrow parsing.
sdk/storage/azure-storage-blob/tests/test_arrow_async.py	New async tests for Arrow listing + pagination + mocked XML/Arrow parsing.
sdk/storage/azure-storage-blob/setup.py	Adds `pyarrow` as a dependency.
sdk/storage/azure-storage-blob/azure/storage/blob/_list_blobs_helper.py	Adds Arrow response parsing and Arrow-aware page iterators (sync).
sdk/storage/azure-storage-blob/azure/storage/blob/_container_client.py	Adds `use_arrow`/`end_before` plumbing for sync `list_blobs`/`walk_blobs`.
sdk/storage/azure-storage-blob/azure/storage/blob/aio/_list_blobs_helper.py	Adds Arrow-aware async page iterators (async).
sdk/storage/azure-storage-blob/azure/storage/blob/aio/_container_client_async.py	Adds `use_arrow`/`end_before` plumbing for async `list_blobs`/`walk_blobs`.
sdk/storage/azure-storage-blob/azure/storage/blob/_generated/operations/_container_operations.py	Adds Arrow request builders and new `*_apache_arrow` sync operations.
sdk/storage/azure-storage-blob/azure/storage/blob/_generated/aio/operations/_container_operations.py	Adds new `*_apache_arrow` async operations and request imports.
sdk/storage/azure-storage-blob/azure/storage/blob/_container_client.pyi	Updates type stub for `list_blobs` (adds `use_arrow`, `end_before`).
sdk/storage/azure-storage-blob/azure/apiview-properties.json	Adds new operation entries for APIView configuration.

Copilot · 2026-04-13T08:04:04Z

sdk/storage/azure-storage-blob/setup.py

    install_requires=[
        "azure-core>=1.37.0",
        "cryptography>=2.1.4",
        "typing-extensions>=4.6.0",
-        "isodate>=0.6.1"
+        "isodate>=0.6.1",
+        "pyarrow>=23.0.1",
    ],


Adding pyarrow to install_requires makes it a mandatory dependency of azure-storage-blob, which is a large native package and can break installation on platforms/Python versions where wheels aren’t available. Consider making Arrow support an optional extra (e.g., extras_require["arrow"]) and keeping the runtime ImportError guidance aligned with that extra, rather than requiring pyarrow for all users.

Copilot · 2026-04-13T08:04:05Z

sdk/storage/azure-storage-blob/azure/storage/blob/_container_client.py

 from datetime import datetime
+from types import SimpleNamespace
 from typing import (
-    Any, AnyStr, cast, Dict, List, IO, Iterable, Iterator, Optional, overload, Union,
+    Any, AnyStr, cast, Dict, List, IO, Iterable, Iterator, Optional, overload, Tuple, Union,
    TYPE_CHECKING
 )


SimpleNamespace and Tuple are imported but not used in this module, which will trigger linting failures and adds noise. Please remove the unused imports (or use them if they were intended for the Arrow implementation).

Copilot · 2026-04-13T08:04:05Z

sdk/storage/azure-storage-blob/azure/storage/blob/_container_client.py

+    return_context_and_deserialized,
    return_response_headers
 )
+from ._shared.uploads import IterStreamer


return_context_and_deserialized and IterStreamer are imported here but not referenced anywhere in the file. Please remove these unused imports to avoid linting issues and keep the module surface area minimal.

Suggested change

return_context_and_deserialized,

return_response_headers

)

from ._shared.uploads import IterStreamer

return_response_headers

)

Copilot · 2026-04-13T08:04:05Z

sdk/storage/azure-storage-blob/azure/storage/blob/_container_client.py

+            self._client.container.list_blob_flat_segment_apache_arrow
+            if use_arrow else self._client.container.list_blob_flat_segment,
            include=include,
            timeout=timeout,
            **kwargs


end_before is documented as a supported keyword here, but it’s never popped/validated before **kwargs is forwarded into the generated operation. If a caller passes end_before with use_arrow=False (or omits use_arrow), it will be forwarded to list_blob_flat_segment(...), which doesn’t accept that parameter and will raise TypeError. Please pop end_before and only pass it to the Apache Arrow operation (or raise a clear ValueError when end_before is provided without use_arrow=True).

Copilot · 2026-04-13T08:04:05Z

sdk/storage/azure-storage-blob/azure/storage/blob/_container_client.py

            self._client.container.list_blob_hierarchy_segment,
            delimiter=delimiter,
            include=include,
            timeout=timeout,
-            **kwargs)
+            **kwargs


end_before is mentioned in the docstring, but it isn’t removed from kwargs when use_arrow is false. That means end_before will be forwarded via **kwargs into list_blob_hierarchy_segment(...), which doesn’t accept it and will raise TypeError. Please pop/validate end_before and only forward it for the Arrow request path (or fail fast with a clear error when end_before is provided without use_arrow=True).

Copilot · 2026-04-13T08:04:08Z

...age/azure-storage-blob/azure/storage/blob/_generated/aio/operations/_container_operations.py

+        """[Update] The List Blobs operation returns a list of the blobs under the specified container.
+        This operation is for Apache Arrow use case so response is returned as raw to be deserialized
+        by the client.
+
+        :param delimiter: When the request includes this parameter, the operation returns a BlobPrefix
+         element in the response body that acts as a placeholder for all blobs whose names begin with
+         the same substring up to the appearance of the delimiter character. The delimiter may be a
+         single character or a string. Required.
+        :type delimiter: str
+        :param prefix: Filters the results to return only containers whose name begins with the
+         specified prefix. Default value is None.
+        :type prefix: str
+        :param marker: A string value that identifies the portion of the list of containers to be
+         returned with the next listing operation. The operation returns the NextMarker value within the
+         response body if the listing operation did not return all containers remaining to be listed
+         with the current page. The NextMarker value can be used as the value for the marker parameter
+         in a subsequent call to request the next page of list items. The marker value is opaque to the
+         client. Default value is None.
+        :type marker: str
+        :param maxresults: Specifies the maximum number of containers to return. If the request does
+         not specify maxresults, or specifies a value greater than 5000, the server will return up to
+         5000 items. Note that if the listing operation crosses a partition boundary, then the service


In the async generated list_blob_hierarchy_segment_apache_arrow docstring, there’s a stray "[Update]" prefix and multiple parameter descriptions still refer to “containers” instead of blobs (prefix, marker, maxresults). Please clean up the docstring so it accurately describes the blob listing operation.

Copilot · 2026-04-13T08:04:08Z

sdk/storage/azure-storage-blob/azure/storage/blob/aio/_list_blobs_helper.py

+            raw_bytes = b"".join(deserialized)
+            next_marker, blob_items = _parse_arrow_response(raw_bytes, self.container)
+            self._arrow_response = (next_marker, blob_items)
+            return location_mode, raw_bytes
+        pipeline_response.http_response.load_body()
+        xml_response = self._deserializer("ListBlobsFlatSegmentResponse", pipeline_response.http_response)


In the async paging helper, pipeline_response.http_response.load_body() is called without await. For async transports, load_body() is typically an async coroutine, so this can leave the body unread (and may emit an un-awaited coroutine warning), breaking the XML fallback path. Consider restructuring the async Arrow operation/paging flow so the XML fallback reads the body via an awaited call (e.g., reading in the generated operation when content-type isn’t Arrow, or avoiding stream=True since the Arrow path already buffers the full response).

Copilot · 2026-04-13T08:04:08Z

sdk/storage/azure-storage-blob/azure/storage/blob/aio/_list_blobs_helper.py

+    def _arrow_cls(self, pipeline_response, deserialized, response_headers):
+        content_type = response_headers.get("Content-Type", "")
+        location_mode = getattr(pipeline_response.http_response, "location_mode", None)
+        if _ARROW_CONTENT_TYPE in content_type:
+            raw_bytes = b"".join(deserialized)
+            next_marker, blob_items = _parse_arrow_response(raw_bytes, self.container)
+            self._arrow_response = (next_marker, blob_items)


deserialized passed into _arrow_cls comes from AsyncHttpResponse.stream_download(...) (see the generated async operation), which is an async iterator. Calling b"".join(deserialized) will fail at runtime because an async iterator isn’t directly iterable. The Arrow bytes should be collected using async for (likely in an async method like _extract_data_cb) or by switching the generated Arrow operation to stream=False and await response.read() to obtain full bytes before invoking cls.

Copilot · 2026-04-13T08:04:09Z

sdk/storage/azure-storage-blob/tests/test_arrow_async.py

+        async def fake_list_blob_flat_segment_apache_arrow(**kwargs):
+            nonlocal page_index
+            cls = kwargs.get("cls")
+            raw = iter([pages[page_index]])


In the async mock for the Arrow response, raw = iter([pages[page_index]]) produces a synchronous iterator. In real async runs, response.stream_download(...) yields an async iterator, so this mock won’t catch issues around consuming the stream (e.g., needing async for). Consider returning an async iterator here to better match the real transport behavior.

Suggested change

raw = iter([pages[page_index]])

async def _raw_pages():

yield pages[page_index]

raw = _raw_pages()

Copilot · 2026-04-13T08:04:09Z

sdk/storage/azure-storage-blob/tests/test_arrow_async.py

+            raw = iter([pages[page_index]])
+            page_index += 1


This async mock uses raw = iter([pages[page_index]]), but the real async pipeline returns an async iterator from stream_download. Using a synchronous iterator here can mask bugs in the Arrow parsing path. Please consider using an async iterator in the mock so the test exercises the same consumption pattern as production code.

Suggested change

raw = iter([pages[page_index]])

page_index += 1

page = pages[page_index]

page_index += 1

async def _raw():

yield page

raw = _raw()

weirongw23-msft added 10 commits April 4, 2026 00:31

Work in progress

2ade030

Merge branch 'main' into weirongw23/photon

b4bb5e1

Work in progress

5630ccf

Pagination

5da96f5

More tests

a81afec

Fixes to existing tests

cd0322d

Mocked tests

bbeda07

Async tests

5072161

Removed preprod

47c555b

Modified XML tests to be paging

c9878bd

weirongw23-msft marked this pull request as ready for review April 13, 2026 07:54

weirongw23-msft requested review from annatisch, jalauzon-msft and vincenttran-msft as code owners April 13, 2026 07:54

Copilot AI review requested due to automatic review settings April 13, 2026 07:54

github-actions bot added the Storage Storage Service (Queues, Blobs, Files) label Apr 13, 2026

Copilot started reviewing on behalf of weirongw23-msft April 13, 2026 07:55 View session

Small fixes

96bfdfb

weirongw23-msft commented Apr 13, 2026

View reviewed changes

Test fixes

7e52e40

Copilot AI reviewed Apr 13, 2026

View reviewed changes

jalauzon-msft changed the title ~~[Storage] Project Photon~~ [Storage] Add support for Arrow in List Blobs Apr 13, 2026

weirongw23-msft marked this pull request as draft April 13, 2026 18:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Storage] Add support for Arrow in List Blobs#46276

[Storage] Add support for Arrow in List Blobs#46276
weirongw23-msft wants to merge 12 commits intoAzure:mainfrom
weirongw23-msft:weirongw23/photon

weirongw23-msft commented Apr 13, 2026

Uh oh!

weirongw23-msft Apr 13, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Copilot AI Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-            raw = iter([pages[page_index]])
+            async def _raw_pages():
+                yield pages[page_index]
+            raw = _raw_pages()

Conversation

weirongw23-msft commented Apr 13, 2026

Uh oh!

weirongw23-msft Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants