feat(cosmos/workloads): add performance metrics collection for DR drill testing by tvaron3 · Pull Request #46271 · Azure/azure-sdk-for-python

tvaron3 · 2026-04-12T03:42:54Z

Unified Workload with Performance Metrics for Cosmos DB DR Drill Testing

Summary

Consolidates 8 separate workload scripts into a single workload.py controlled entirely by environment variables, and adds a performance metrics
collection layer that reports latency, throughput, errors, and resource
utilization to a Cosmos DB results account — enabling Grafana dashboards and ADX-powered analysis for DR drill testing.

Motivation

Previously, testing different workload configurations (read-only, write-only, proxy, sync, etc.) required copying and modifying entire workload files.
Adding observability meant manually instrumenting each file. This PR solves
both problems:

One workload, many configurations — env vars control operations, proxy, sync, and client behavior
Built-in metrics — every operation is timed and reported automatically

Architecture

workload.py ├── WORKLOAD_OPERATIONS → read, write, query (comma-separated) ├── WORKLOAD_USE_PROXY → route through Envoy proxy ├── WORKLOAD_USE_SYNC → sync
vs async client ├── WORKLOAD_SKIP_CLOSE → simulate applications that
don't close the client └── PerfReporter (daemon thread, configurable interval) ├── HdrHistogram per operation (p50/p90/p99, O(1) record) ├── psutil for
CPU/memory metrics ├── Error status code + sub-status code tracking └──
Upserts PerfResult docs to results Cosmos DB account

New Files

File	Description
`workload.py`	Single unified workload — replaces all 8 operation-specific files
`perf_stats.py`	Per-operation latency tracking via HdrHistogram — O(1) record, O(1) percentile,
fixed 40KB memory per histogram
`perf_reporter.py`	Background daemon thread that upserts PerfResult documents to a results Cosmos DB account at a configurable interval
`perf_config.py`	All metrics configuration from environment variables

Modified Files

File	Change
`workload_configs.py`	All SDK configs now driven by environment variables (no manual editing)
`workload_utils.py`	Added timed operation wrappers using `time.perf_counter_ns()` with error status code capture

Deleted Files

r_workload.py, w_workload.py, r_proxy_workload.py, w_proxy_workload.py, r_w_q_workload.py, r_w_q_proxy_workload.py, r_w_q_sync_workload.py,
r_w_q_with_incorrect_client_workload.py, run_workloads.sh, dev.md

All replaced by workload.py with environment variable configuration. Infrastructure/orchestration scripts (run_workloads.sh, dev.md) moved to the
cosmos-sdk-copilot-toolkit repo.

Environment Variables

Workload behavior:

Variable	Default	Description
`WORKLOAD_OPERATIONS`	`read,write,query`	Comma-separated operations to run
`WORKLOAD_USE_PROXY`	`false`	Route through Envoy proxy
`WORKLOAD_USE_SYNC`	`false`	Use sync client instead of async
`WORKLOAD_SKIP_CLOSE`	`false`	Don't close client (simulate incorrect usage)

Cosmos DB client:

Variable	Default	Description
`COSMOS_URI`	(required)	Workload account endpoint
`COSMOS_PREFERRED_LOCATIONS`	`""`	Comma-separated preferred regions
`COSMOS_CLIENT_EXCLUDED_LOCATIONS`	`""`	Comma-separated excluded regions
`COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS`	`false`	Enable multi-region writes
`AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER`	`false`	Per-partition circuit breaker
`COSMOS_CONCURRENT_REQUESTS`	`100`	Concurrent operations per process
`COSMOS_THROUGHPUT`	`1000000`	Container provisioned throughput

Metrics reporting:

Variable	Default	Description
`RESULTS_COSMOS_URI`	`""`	Results account endpoint (enables metrics when set)
`PERF_ENABLED`	`true`	Toggle metrics collection
`PERF_REPORT_INTERVAL`	`300`	Seconds between metric reports

PerfResult Document Schema

Each document matches the Rust SDK perf tool schema, enabling
both SDKs to feed the same ADX → Grafana pipeline:

{
  "operation": "ReadItem",
  "count": 12345,
  "errors": 0,
  "min_ms": 1.2, "max_ms": 50.3, "mean_ms": 5.4,
  "p50_ms": 4.8, "p90_ms": 8.2, "p99_ms": 15.1,
  "cpu_percent": 45.2,
  "memory_bytes": 104857600,
  "system_cpu_percent": 62.1,
  "sdk_language": "python",
  "sdk_version": "4.15.0",
  "config_concurrency": 100,
  "config_ppcb_enabled": false
}

Usage Examples

# All operations, direct connection
WORKLOAD_OPERATIONS=read,write,query python3 workload.py

# Read-only via Envoy proxy
WORKLOAD_OPERATIONS=read WORKLOAD_USE_PROXY=true python3 workload.py

# Write-only with metrics reporting
WORKLOAD_OPERATIONS=write \
  RESULTS_COSMOS_URI=https://my-results.documents.azure.com:443/ \
  python3 workload.py

# Simulate unclosed client
WORKLOAD_SKIP_CLOSE=true python3 workload.py

# Scale to 5 processes
for i in $(seq 5); do
  WORKLOAD_OPERATIONS=read,write,query nohup python3 workload.py &
done

…hroughput - Uncomment concurrent upsert/read/query calls - Remove manual timing counters and log_request_counts - Set THROUGHPUT to 1000000 in workload_configs.py - Keep CIRCUIT_BREAKER_ENABLED = False (PPCB disabled) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Add a performance metrics library that reports PerfResult documents to a Cosmos DB results account, matching the Rust perf tool schema exactly so both SDKs feed the same ADX → Grafana pipeline. New files: - perf_stats.py: Thread-safe latency histogram with sorted-list percentile calculation and atomic drain_all() for consistent summary+error snapshots - perf_config.py: All config from environment variables (RESULTS_COSMOS_URI, PERF_REPORT_INTERVAL=300s, perfdb/perfresults defaults) - perf_reporter.py: Background daemon thread that drains Stats every 5 min and upserts PerfResult documents via sync CosmosClient with AAD auth Modified files: - workload_configs.py: All configs now driven by environment variables - workload_utils.py: Added timed operation wrappers with error tracking (CosmosHttpResponseError status_code/sub_status extraction), only successful operations record latency to avoid polluting percentiles - All *_workload.py files: Integrated Stats + PerfReporter with try/finally lifecycle management Key design decisions: - Sorted-list percentiles (no hdrhistogram native dependency) - psutil for CPU/memory with /proc fallback on Linux - Cached psutil.Process() instance for accurate CPU readings - CosmosClient stored and closed properly to avoid resource leaks - sdk_language='python', sdk_version from azure.cosmos.__version__ - PPCB disabled by default - All reporter errors caught and logged as warnings (never crash workload) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

psutil is now a hard import (not optional). Removed all /proc/meminfo and /proc/self/status fallback parsing — if psutil is not installed, the import will fail immediately rather than silently degrading. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

- Single workload.py replaces 6 operation-specific files - WORKLOAD_OPERATIONS env var controls which ops run (read,write,query) - WORKLOAD_USE_PROXY env var enables Envoy proxy routing - WORKLOAD_USE_SYNC env var enables sync client - Validate operation names at import time with clear error - Replace manual sorted-list percentiles with hdrhistogram (O(1) record/query) - Fixed memory usage (~40KB per histogram vs unbounded list growth) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rkload.py Removed: r_workload.py, w_workload.py, r_proxy_workload.py, w_proxy_workload.py, r_w_q_workload.py, r_w_q_proxy_workload.py, r_w_q_sync_workload.py All replaced by workload.py with WORKLOAD_OPERATIONS and WORKLOAD_USE_PROXY env vars. Kept: r_w_q_with_incorrect_client_workload.py (special test case) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Replaces r_w_q_with_incorrect_client_workload.py with an env var: WORKLOAD_SKIP_CLOSE=true creates the client without a context manager, simulating applications that don't properly close the Cosmos client. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Switch from time.perf_counter() * 1000 to time.perf_counter_ns() / 1_000_000 for nanosecond precision without floating-point multiplication artifacts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Infra/orchestration scripts belong in the cosmos-sdk-copilot-toolkit repo, not in the SDK repo. Workload code (workload.py, perf_*, workload_utils.py) stays here. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…istogram The pip package is 'hdrhistogram' but the Python module is 'hdrh'. Import changed from 'from hdrhistogram import HdrHistogram' to 'from hdrh.histogram import HdrHistogram'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Reports COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS in the config snapshot so it's visible in the Grafana dashboard and queryable from Kusto. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

The variable was used but never defined — caused pylint E0602. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…, histogram clamp, safe parsing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…dictionary Move cspell words to sdk/cosmos/azure-cosmos/cspell.json instead of root .vscode/cspell.json to keep changes within cosmos folder scope. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…faultAzureCredential Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…rfResult Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t-toolkit Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR modernizes the Cosmos DB DR drill workload harness by consolidating multiple per-scenario scripts into a single environment-variable-driven workload runner, and adds an optional performance metrics collection/reporting layer that can upsert results to a separate Cosmos DB account for dashboarding and analysis.

Changes:

Consolidates prior workload entrypoints into a single workload.py controlled by environment variables.
Adds performance statistics collection (per-operation latency + errors) and a background reporter that periodically upserts PerfResult documents.
Refactors workload configuration and operation helpers to be environment-variable-driven and to support timed operation wrappers.

Reviewed changes

Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
sdk/cosmos/azure-cosmos/tests/workloads/workload.py	New unified workload entrypoint (sync/async + proxy + perf hooks).
sdk/cosmos/azure-cosmos/tests/workloads/workload_utils.py	Adds timed operation wrappers and Cosmos error status/substatus extraction for perf stats.
sdk/cosmos/azure-cosmos/tests/workloads/workload_configs.py	Moves workload/client configuration to environment variables with parsing/validation.
sdk/cosmos/azure-cosmos/tests/workloads/perf_stats.py	New thread-safe per-operation latency histograms + error aggregation (HdrHistogram-backed).
sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py	New daemon-thread reporter to drain stats and upsert PerfResult/Error docs to a results Cosmos account.
sdk/cosmos/azure-cosmos/tests/workloads/perf_config.py	New perf reporter configuration builder from environment variables (includes git SHA + defaults).
sdk/cosmos/azure-cosmos/cspell.json	Adds workload/perf-specific dictionary exceptions for spell checking.
sdk/cosmos/azure-cosmos/tests/workloads/w_workload.py	Removes legacy write-only workload script (replaced by `workload.py`).
sdk/cosmos/azure-cosmos/tests/workloads/w_proxy_workload.py	Removes legacy write-via-proxy workload script (replaced by `workload.py`).
sdk/cosmos/azure-cosmos/tests/workloads/r_workload.py	Removes legacy read/query workload script (replaced by `workload.py`).
sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_workload.py	Removes legacy mixed workload script (replaced by `workload.py`).
sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_with_incorrect_client_workload.py	Removes legacy “incorrect client usage” script (replaced by `WORKLOAD_SKIP_CLOSE`).
sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_sync_workload.py	Removes legacy sync mixed workload script (replaced by `workload.py`).
sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_proxy_workload.py	Removes legacy mixed workload + proxy script (replaced by `workload.py`).
sdk/cosmos/azure-cosmos/tests/workloads/r_proxy_workload.py	Removes legacy read/query-via-proxy workload script (replaced by `workload.py`).
sdk/cosmos/azure-cosmos/tests/workloads/setup_env.sh	Removes local environment bootstrap script (moved out of repo per PR description).
sdk/cosmos/azure-cosmos/tests/workloads/run_workloads.sh	Removes legacy multi-script launcher (superseded by env-var-driven workload).
sdk/cosmos/azure-cosmos/tests/workloads/dev.md	Removes legacy local runbook doc (moved out of repo per PR description).

sdk/cosmos/azure-cosmos/tests/workloads/workload.py

sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py

Extract _extra_kwargs(), _timed_call(), and _timed_call_async() to eliminate duplicated excluded_locations branching and timing/error recording across 6 operation functions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

… flush lock 1. Use public azure.core.pipeline.transport import for AioHttpTransport 2. Fail fast with RuntimeError if WORKLOAD_USE_SYNC + WORKLOAD_USE_PROXY 3. Add threading.Lock around _flush(), skip final flush if thread alive Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…filing Allows spawning multiple CosmosClient instances in a single process via WORKLOAD_NUM_CLIENTS env var (default: 1). Useful for reproducing memory scaling issues with many client instances. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kushagraThapar · 2026-04-14T18:11:48Z

sdk/cosmos/azure-cosmos/tests/workloads/perf_stats.py

+        # min 1 microsecond, max 60 seconds (in microseconds), 3 significant digits
+        self._histograms: dict[str, HdrHistogram] = {}
+        self._error_counts: dict[str, int] = {}
+        self._errors: list[dict] = []


maybe put a cap to the errors we record? or first in first out type of queue, so that we only keep the last few errors?

Addressed in 8c53514. Replaced unbounded list with collections.deque(maxlen=2000) to keep the last 2000 errors. In practice, errors at this level should be minimal since the SDK handles retries internally — what surfaces here are non-retryable failures like 404s.

kushagraThapar · 2026-04-14T18:12:59Z

sdk/cosmos/azure-cosmos/tests/workloads/workload_utils.py

-async def upsert_item_concurrently(container, excluded_locations, num_upserts):
+        random_item = get_existing_random_item()
+
+        def _do_query():


I like how you did in the async version -> do_query(ri=random_item)

Good eye — fixed in 8c53514. Updated sync version to use def _do_query(ri=random_item) matching the async pattern.

kushagraThapar · 2026-04-14T18:17:43Z

sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py

+        for s in summaries:
+            doc = {
+                "id": str(uuid.uuid4()),
+                "partition_key": str(uuid.uuid4()),


curious if we can be more streamlined in generation of pk - like using something related to workload id or something else which is more stable and deterministic rather than using uuid now. Since we are really scaling here, using UUID will just create too many cross pk queries, etc.

We only use change feed on the results container (→ ADX via Cosmos DB data connection), not cross-partition queries directly. The random partition_key distributes writes evenly across physical partitions. Could follow up with workload_id-based partitioning if we add direct querying on the results container in the future.

kushagraThapar · 2026-04-14T18:18:40Z

sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py

+                "config_skip_close": skip_close,
+            }
+            try:
+                self._container.upsert_item(doc)


so every flush call upserts one item at a time? won't this be a perf bottleneck?

At the default 5-minute report interval (PERF_REPORT_INTERVAL=300), each flush upserts ~3-4 items (one per operation type + errors). That's <0.5s of work on a background daemon thread every 5 minutes — not a bottleneck. The Python sync SDK also doesn't have a bulk/batch API. If we reduce the interval significantly, we could revisit.

…ture - perf_stats: Replace unbounded _errors list with deque(maxlen=2000). Errors at this level are minimal due to SDK internal retries. - workload_utils: Capture loop variable in sync query_items with def _do_query(ri=random_item) to match async pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

aayush3011 · 2026-04-14T19:25:20Z

sdk/cosmos/azure-cosmos/tests/workloads/workload.py

+    tasks = []
+    for i in range(WORKLOAD_NUM_CLIENTS):
+        client_id = f"{prefix}-c{i}"
+        tasks.append(run_workload_async(client_id, client_logger))


🟡 Recommendation — Resource Waste in Multi-Client Mode

In multi-client mode (WORKLOAD_NUM_CLIENTS > 1), each async task creates its own Stats, PerfReporter, daemon thread, and sync CosmosClient. Three issues:

Fragmented workload IDs: Each Stats calls get_perf_config(), which generates a fresh uuid.uuid4() for workload_id if PERF_WORKLOAD_ID isn't set. Results from the same process get different IDs — making correlation impossible in Grafana/ADX.

Fragmented metrics: N separate summaries per flush interval, each with 1/Nth of the data. P99 from 1/5th of operations is statistically meaningless.

Resource waste: N daemon threads + N sync CosmosClient instances (each with its own HTTP connection pool) just for writing metrics.

Suggestion: Create one shared Stats + PerfReporter at the run_multi_client_async level and pass it into each task. Stats is already thread-safe (uses locks).

Addressed in 314405e. run_multi_client_async now creates a single shared Stats + PerfReporter and passes them to each run_workload_async task. Single workload_id, consolidated metrics, one daemon thread. Individual tasks only create their own reporter if none is passed (single-client mode).

aayush3011 · 2026-04-14T19:25:21Z

sdk/cosmos/azure-cosmos/tests/workloads/perf_stats.py

+                    _MIN_VALUE_US, _MAX_VALUE_US, 3
+                )
+            self._error_counts[operation] += 1
+            self._errors.append(


🟡 Recommendation — Unbounded Memory Growth

_errors accumulates every error as a dict (with full traceback.format_exc() strings of ~2-5KB each) with no size cap. Between 5-minute drain intervals, a sustained failure scenario could spike memory.

This tool is designed for DR drill testing — the exact scenario where error storms (region failovers) are expected. With 100 concurrent requests failing at ~12 errors/sec × 300 seconds = ~3,600 errors × ~5KB each ≈ 18MB per drain interval.

The aggregate error count is already tracked separately in _error_counts, so no count information is lost.

Suggestion: Add a maximum error buffer size (e.g., 1000). Keep _error_counts for accurate totals, cap the detailed list.

Addressed in earlier commit (8c53514). Replaced with collections.deque(maxlen=2000) — keeps last 2000 errors (~10MB max). Errors at this level should be minimal due to SDK internal retries; what surfaces are non-retryable failures. _error_counts still tracks accurate totals regardless of the cap.

aayush3011 · 2026-04-14T19:25:22Z

sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py

+        for s in summaries:
+            doc = {
+                "id": str(uuid.uuid4()),
+                "partition_key": str(uuid.uuid4()),


🟡 Recommendation — Partition Key Anti-Pattern

Both id and partition_key are set to independent random UUIDs for every document. This creates a new logical partition for every single result document.

This is a known Cosmos DB anti-pattern:

Querying "all results for workload X" or "all results from this host" requires expensive cross-partition fan-out queries.

Results from the same test run are scattered across thousands of partitions.

For ADX/Grafana dashboards doing time-series queries, this means high RU cost and latency.

Suggestion: Use workload_id as the partition key value. All results from one workload run land in one partition → efficient point reads and range scans for dashboards.

We only use change feed on the results container (→ ADX via Cosmos DB data connection), not cross-partition queries directly. The random partition_key distributes writes evenly. Could follow up with workload_id-based partitioning if direct querying on the results container is needed in the future.

aayush3011 · 2026-04-14T19:25:23Z

sdk/cosmos/azure-cosmos/tests/workloads/perf_stats.py

+            self._errors = deque(maxlen=2000)
+            return summaries, error_details
+
+    def drain_summaries(self) -> list[dict]:


🟡 Recommendation — Data-Loss API Trap

drain_summaries() and drain_errors() both delegate to drain_all(), which atomically clears all histograms, error counts, and error details. Calling drain_summaries() silently and permanently discards all accumulated errors (and vice versa).

The method names suggest "drain only summaries" but the implementation drains (and discards) everything. A future caller might reasonably write:

summaries = stats.drain_summaries() errors = stats.drain_errors() # Returns empty! Data was already cleared.

Suggestion: Either (a) remove these methods — force callers to use drain_all(), or (b) add a prominent docstring warning about the destructive behavior.

Addressed in 314405e. Removed drain_summaries() and drain_errors() — callers now use drain_all() only, which is the only method used by PerfReporter._flush().

aayush3011 · 2026-04-14T19:25:24Z

sdk/cosmos/azure-cosmos/tests/workloads/workload_configs.py

+NUMBER_OF_LOGICAL_PARTITIONS = int(
+    os.environ.get("COSMOS_NUMBER_OF_LOGICAL_PARTITIONS", "10000")
+)
+THROUGHPUT = int(os.environ.get("COSMOS_THROUGHPUT", "1000000"))


🟡 Recommendation — Silent 10x Cost Increase

Default THROUGHPUT changed from 100000 (100K RU/s) to 1000000 (1M RU/s) — a 10x increase.

The untouched initial-setup.py imports THROUGHPUT and uses it in ThroughputProperties(THROUGHPUT) when creating containers. Anyone running initial-setup.py without explicitly setting COSMOS_THROUGHPUT will provision 10x the previous default (~$80/hr vs ~$8/hr).

Suggestion: If intentional for CI pipelines, add a comment explaining the increase and document it as a breaking change. If not, revert to 100000.

Addressed in 314405e. Reverted default to 100K RU/s. Added comment that DR drills should set COSMOS_THROUGHPUT=1000000 explicitly.

aayush3011 · 2026-04-14T19:25:25Z

sdk/cosmos/azure-cosmos/tests/workloads/workload_configs.py

+USE_MULTIPLE_WRITABLE_LOCATIONS = (
+    os.environ.get("COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS", "false").lower() == "true"
+)
+CONCURRENT_REQUESTS = int(os.environ.get("COSMOS_CONCURRENT_REQUESTS", "100"))


🟡 Recommendation — Crash on Invalid Input

Five config values use bare int() without error handling: CONCURRENT_REQUESTS, CONCURRENT_QUERIES, WORKLOAD_NUM_CLIENTS, NUMBER_OF_LOGICAL_PARTITIONS, and THROUGHPUT.

Meanwhile, perf_config.py in the same PR introduces _safe_int() for exactly this purpose.

A typo like COSMOS_CONCURRENT_REQUESTS=1O0 (letter O) crashes at import time with an unhelpful ValueError.

Suggestion: Use the _safe_int pattern from perf_config.py for consistency, or wrap with try/except that identifies which env var is invalid.

Addressed in 314405e. All int env vars in workload_configs.py now use _safe_int() from perf_config.py with fallback defaults.

aayush3011 · 2026-04-14T19:25:26Z

sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py

+        sys_cpu = _get_system_cpu_percent()
+        sys_total, sys_used = _get_system_memory()
+
+        concurrency = _safe_int_env("COSMOS_CONCURRENT_REQUESTS", 100)


🟡 Recommendation — Configuration Inconsistency

Eight os.environ.get() / _safe_int_env() calls inside _flush() re-read configuration from the environment on every flush interval (every 5 minutes). But workload_configs.py captures these same values once at import time.

Two issues:

Semantic inconsistency: If env vars were modified after startup (unusual but possible), the workload uses old values while metrics report new ones.

Which is the source of truth? workload_configs.py reads at import; _flush() re-reads every interval.

Suggestion: Capture these values once in __init__() or start() and store as instance attributes. Or accept a config dict from the caller.

The re-reading in _flush() is intentional — it allows config changes at runtime without restarting workloads (useful during DR drills when excluded regions change). Will refactor workload_configs.py to also re-read on access in a follow-up for full consistency.

aayush3011 · 2026-04-14T19:25:28Z

sdk/cosmos/azure-cosmos/tests/workloads/workload.py

+                    client_logger.error(e)
+    finally:
+        if reporter:
+            reporter.stop()


🟡 Recommendation — Inconsistent Error Handling

The sync path does not protect reporter.stop() in the finally block, but the async path (line ~102) wraps it in try/except.

If the workload throws an exception AND reporter.stop() also throws (e.g., network error flushing final metrics), the original exception is masked by the cleanup exception.

This matters because DR drills are the scenario most likely to have both workload errors AND reporter errors simultaneously.

Suggestion: Apply the same try/except pattern as the async path. Consider logging rather than silently swallowing.

Addressed in 314405e. Added try/except around reporter.stop() in the sync path, matching the async pattern.

aayush3011 · 2026-04-14T19:25:29Z

sdk/cosmos/azure-cosmos/tests/workloads/workload_utils.py

+

 def create_logger(file_name):
    os.environ["AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER"] = str(CIRCUIT_BREAKER_ENABLED)


🟢 Suggestion — Misplaced Side Effect

A function named create_logger() shouldn't set global environment variables that control SDK behavior. The circuit breaker config isn't applied until a logger is created, which is invisible to callers.

This was flagged in a past PR review (#40843) as a misplaced concern.

Suggestion: Move the env var setting to workload_configs.py (at import time, alongside the other config), or to the __main__ block in workload.py.

Addressed in 314405e. Moved os.environ["AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER"] setting from create_logger() to workload_configs.py alongside the other config.

aayush3011 · 2026-04-14T19:25:30Z

sdk/cosmos/azure-cosmos/tests/workloads/workload.py

+                    client_logger.info("Exception in application layer")
+                    client_logger.error(e)
+        finally:
+            if not WORKLOAD_SKIP_CLOSE:


🟢 Suggestion — Duplicated Client Constructor

Both branches construct the client identically — only the __aenter__() call differs. If constructor args are modified, both branches must be updated.

Suggestion:

client = AsyncClient(COSMOS_URI, COSMOS_CREDENTIAL, **client_kwargs) if not WORKLOAD_SKIP_CLOSE: await client.__aenter__()

Addressed in 314405e. Consolidated to a single constructor call with conditional __aenter__().

aayush3011 · 2026-04-14T19:25:31Z

sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py

+                "workload_id": self._config["workload_id"],
+                "commit_sha": self._config["commit_sha"],
+                "hostname": self._hostname,
+                "TIMESTAMP": now,


🟢 Suggestion — Inconsistent Field Naming

Every field in the PerfResult document uses snake_case except TIMESTAMP which is ALL_CAPS. This creates inconsistency in Kusto/ADX queries.

Suggestion: Rename to timestamp for consistency, or add a comment if the ALL_CAPS convention is required for Rust SDK schema compatibility.

Keeping TIMESTAMP in ALL_CAPS — it's required for Rust SDK PerfResults schema compatibility (both SDKs write to the same ADX table). Added a comment explaining this.

aayush3011 · 2026-04-14T19:25:32Z

sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py

+                logger.warning("PerfReporter error upsert failed: %s", e)
+
+
+def _safe_int_env(name: str, default: int) -> int:


🟢 Suggestion — Duplicated Safe-Int Helper

This PR introduces two nearly identical safe-int-parsing helpers:

perf_config.py: _safe_int(value, default) — parses a value

perf_reporter.py: _safe_int_env(name, default) — reads env var then parses

Bug fixes in one won't propagate to the other.

Suggestion: Consolidate into a single helper in perf_config.py and import it here.

Addressed in 314405e. Consolidated _safe_int_env into perf_config.py and imported it in perf_reporter.py.

…ated helpers 1. Share Stats+PerfReporter across multi-client tasks (single workload_id) 2. Remove drain_summaries/drain_errors (use drain_all only) 3. Revert THROUGHPUT default to 100K (set COSMOS_THROUGHPUT=1000000 for DR drills) 4. Use _safe_int for all int env vars in workload_configs.py 5. Add try/except to sync reporter.stop() matching async path 6. Move circuit breaker env var from create_logger to workload_configs 7. Consolidate duplicated client constructor 8. Add TIMESTAMP schema compatibility comment 9. Consolidate _safe_int_env into perf_config.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tvaron3 and others added 15 commits April 9, 2026 19:29

perf(workloads): use perf_counter_ns for higher precision timing

14d7797

Switch from time.perf_counter() * 1000 to time.perf_counter_ns() / 1_000_000 for nanosecond precision without floating-point multiplication artifacts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix(workloads): use get_mean_value() for hdrh API

63ae1a8

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

style(workloads): fix black formatting and setup_env.sh references

52a3956

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat(workloads): add config_multi_write_enabled to PerfResult

12b4a20

Reports COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS in the config snapshot so it's visible in the Grafana dashboard and queryable from Kusto. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix(workloads): define multi_write variable in perf_reporter

f647d54

The variable was used but never defined — caused pylint E0602. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix(workloads): address review findings — lazy imports, session close…

b595bd9

…, histogram clamp, safe parsing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

github-actions bot added the Cosmos label Apr 12, 2026

github-project-automation bot added this to CosmosDB Python Eco-System Apr 12, 2026

tvaron3 and others added 3 commits April 11, 2026 20:53

fix(workloads): use RESULTS_COSMOS_KEY when available, fallback to De…

8ce76f9

…faultAzureCredential Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

feat(workloads): add config_proxy_enabled and config_skip_close to Pe…

42ee5c2

…rfResult Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

refactor(workloads): remove setup_env.sh — moved to cosmos-sdk-copilo…

b3354db

…t-toolkit Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

tvaron3 marked this pull request as ready for review April 12, 2026 18:18

tvaron3 requested a review from a team as a code owner April 12, 2026 18:18

Copilot AI review requested due to automatic review settings April 12, 2026 18:18

Copilot started reviewing on behalf of tvaron3 April 12, 2026 18:19 View session

Copilot AI reviewed Apr 12, 2026

View reviewed changes

sdk/cosmos/azure-cosmos/tests/workloads/workload.py Outdated Show resolved Hide resolved

sdk/cosmos/azure-cosmos/tests/workloads/workload.py Show resolved Hide resolved

sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py Outdated Show resolved Hide resolved

tvaron3 and others added 5 commits April 12, 2026 21:02

fix(workloads): re-enable WorkloadLoggerFilter to reduce log noise

84c0d41

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

fix(workloads): rename coro to coroutine/awaitable to fix cspell errors

7bad5ab

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

ci: retrigger pipeline

3051f31

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

kushagraThapar reviewed Apr 14, 2026

View reviewed changes

aayush3011 reviewed Apr 14, 2026

View reviewed changes

simorenoh approved these changes Apr 14, 2026

View reviewed changes



		def create_logger(file_name):
		os.environ["AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER"] = str(CIRCUIT_BREAKER_ENABLED)

		logger.warning("PerfReporter error upsert failed: %s", e)


		def _safe_int_env(name: str, default: int) -> int:

Conversation

tvaron3 commented Apr 12, 2026

Unified Workload with Performance Metrics for Cosmos DB DR Drill Testing

Summary

Motivation

Architecture

New Files

Modified Files

Deleted Files

Environment Variables

PerfResult Document Schema

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!