perf(cosmos): strip unused fields from partition key range cache to reduce memory#46297
perf(cosmos): strip unused fields from partition key range cache to reduce memory#46297tvaron3 wants to merge 2 commits intoAzure:mainfrom
Conversation
5afe162 to
fc47cf0
Compare
kushagraThapar
left a comment
There was a problem hiding this comment.
Thanks @tvaron3
I am curious, are there other places where we build the collection routing map? Shall we fix those as well?
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py
Outdated
Show resolved
Hide resolved
| @@ -39,6 +64,8 @@ class PartitionKeyRange(object): | |||
| class Range(object): | |||
| """description of class""" | |||
|
|
|||
| __slots__ = ('min', 'max', 'isMinInclusive', 'isMaxInclusive') | |||
There was a problem hiding this comment.
__slots__ tells Python to store instance attributes in a fixed-size array instead of a per-instance __dict__ dictionary. Only thing we should watch out for is that we will get an error if we try to add a new field to this object at runtime, but we don't do this.
There was a problem hiding this comment.
can we please add a comment above to explain this, thanks!
There was a problem hiding this comment.
Can we use the __slots__ approach in _PartitionHealthInfo? I do not expect the attributes here to get added/removed dynamically?
There was a problem hiding this comment.
Yeah I will add to PartitionHealthInfo as well and add a comment explaining
sdk/cosmos/azure-cosmos/azure/cosmos/_routing/aio/routing_map_provider.py
Show resolved
Hide resolved
6b801a2 to
378f07e
Compare
| for parentId in parents: | ||
| parentIds.add(parentId) | ||
| return ( | ||
| PKRange(id=r[routing_range.PartitionKeyRange.Id], |
There was a problem hiding this comment.
It'd be helpful to understand if the PKRange reference can be used as-is in _GlobalPartitionEndpointManagerForCircuitBreaker and _GlobalPartitionEndpointManagerForPerPartitionAutomaticFailoverAsync.
|
Superseded by shared cache approach. |
378f07e to
342d80a
Compare
1. Share CollectionRoutingMap cache across clients per endpoint. Eliminates N-1 redundant copies when N clients target the same account. 2. Add __slots__ to Range class (64 bytes vs ~250 bytes per instance). 3. Skip .upper() when string is already uppercase. PPCB overhead (150 clients, tracemalloc): Original: 27.4 MB -> Patched: ~0 MB (-100%) At customer scale (200K partitions x 152 clients): ~2.1 GB -> ~14 MB Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
128d459 to
8b03fa2
Compare
…storage Convert raw service response dicts to PKRange namedtuples in both full refresh (_build_routing_map_from_ranges) and incremental update (process_fetched_ranges) paths. PKRange retains only 4 fields (id, minInclusive, maxExclusive, parents) and supports dict-style access for backward compatibility. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Summary
Three optimizations to reduce
CollectionRoutingMapmemory footprint when PPCB (Per Partition Circuit Breaker) is enabled. EachCosmosClientmaintains its own routing map cache containing all partition key ranges. For accounts with many partitions and many client instances, this dominates memory usage.Changes
1. Strip unused fields → compact PKRange namedtuple
_routing/aio/routing_map_provider.py+_routing/routing_map_provider.pyThe service returns 13 fields per partition key range, but
CollectionRoutingMaponly uses 4 (id,minInclusive,maxExclusive,parents). After fetching, we now convert to aPKRangenamedtuple that supports dict-style[key]access for backward compatibility.Dropped fields:
_rid,_etag,ridPrefix,_self,throughputFraction,status,ownedArchivalPKRangeIds,_ts,lsn2. Add
__slots__to Range class_routing/routing_range.pyRangeobjects store 4 instance attributes (min,max,isMinInclusive,isMaxInclusive). Adding__slots__eliminates the per-instance__dict__, saving ~100 bytes per Range object. With 100 partitions x 150 clients = 15K Range objects.3. Skip redundant
.upper()on hex strings_routing/routing_range.pyRange.__init__calls.upper()unconditionally onmin/maxstrings. The Cosmos service returns uppercase hex (e.g.10F0F0F0...). We now check first and skip the copy when already uppercase.Memory Profiling Results
Test setup:
tracemalloc(retained memory)read_item+ 1upsert_itemAZURE_COSMOS_ENABLE_CIRCUIT_BREAKER=TrueCurrent Memory (MB)
PPCB Overhead Reduction
Reproduction Script