feat(cosmos/workloads): add performance metrics collection for DR drill testing#46271
feat(cosmos/workloads): add performance metrics collection for DR drill testing#46271tvaron3 wants to merge 26 commits intoAzure:mainfrom
Conversation
…hroughput - Uncomment concurrent upsert/read/query calls - Remove manual timing counters and log_request_counts - Set THROUGHPUT to 1000000 in workload_configs.py - Keep CIRCUIT_BREAKER_ENABLED = False (PPCB disabled) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add a performance metrics library that reports PerfResult documents to a Cosmos DB results account, matching the Rust perf tool schema exactly so both SDKs feed the same ADX → Grafana pipeline. New files: - perf_stats.py: Thread-safe latency histogram with sorted-list percentile calculation and atomic drain_all() for consistent summary+error snapshots - perf_config.py: All config from environment variables (RESULTS_COSMOS_URI, PERF_REPORT_INTERVAL=300s, perfdb/perfresults defaults) - perf_reporter.py: Background daemon thread that drains Stats every 5 min and upserts PerfResult documents via sync CosmosClient with AAD auth Modified files: - workload_configs.py: All configs now driven by environment variables - workload_utils.py: Added timed operation wrappers with error tracking (CosmosHttpResponseError status_code/sub_status extraction), only successful operations record latency to avoid polluting percentiles - All *_workload.py files: Integrated Stats + PerfReporter with try/finally lifecycle management Key design decisions: - Sorted-list percentiles (no hdrhistogram native dependency) - psutil for CPU/memory with /proc fallback on Linux - Cached psutil.Process() instance for accurate CPU readings - CosmosClient stored and closed properly to avoid resource leaks - sdk_language='python', sdk_version from azure.cosmos.__version__ - PPCB disabled by default - All reporter errors caught and logged as warnings (never crash workload) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
psutil is now a hard import (not optional). Removed all /proc/meminfo and /proc/self/status fallback parsing — if psutil is not installed, the import will fail immediately rather than silently degrading. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- Single workload.py replaces 6 operation-specific files - WORKLOAD_OPERATIONS env var controls which ops run (read,write,query) - WORKLOAD_USE_PROXY env var enables Envoy proxy routing - WORKLOAD_USE_SYNC env var enables sync client - Validate operation names at import time with clear error - Replace manual sorted-list percentiles with hdrhistogram (O(1) record/query) - Fixed memory usage (~40KB per histogram vs unbounded list growth) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rkload.py Removed: r_workload.py, w_workload.py, r_proxy_workload.py, w_proxy_workload.py, r_w_q_workload.py, r_w_q_proxy_workload.py, r_w_q_sync_workload.py All replaced by workload.py with WORKLOAD_OPERATIONS and WORKLOAD_USE_PROXY env vars. Kept: r_w_q_with_incorrect_client_workload.py (special test case) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replaces r_w_q_with_incorrect_client_workload.py with an env var: WORKLOAD_SKIP_CLOSE=true creates the client without a context manager, simulating applications that don't properly close the Cosmos client. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch from time.perf_counter() * 1000 to time.perf_counter_ns() / 1_000_000 for nanosecond precision without floating-point multiplication artifacts. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Infra/orchestration scripts belong in the cosmos-sdk-copilot-toolkit repo, not in the SDK repo. Workload code (workload.py, perf_*, workload_utils.py) stays here. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…istogram The pip package is 'hdrhistogram' but the Python module is 'hdrh'. Import changed from 'from hdrhistogram import HdrHistogram' to 'from hdrh.histogram import HdrHistogram'. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Reports COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS in the config snapshot so it's visible in the Grafana dashboard and queryable from Kusto. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
The variable was used but never defined — caused pylint E0602. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…, histogram clamp, safe parsing Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…dictionary Move cspell words to sdk/cosmos/azure-cosmos/cspell.json instead of root .vscode/cspell.json to keep changes within cosmos folder scope. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…faultAzureCredential Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…rfResult Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…t-toolkit Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR modernizes the Cosmos DB DR drill workload harness by consolidating multiple per-scenario scripts into a single environment-variable-driven workload runner, and adds an optional performance metrics collection/reporting layer that can upsert results to a separate Cosmos DB account for dashboarding and analysis.
Changes:
- Consolidates prior workload entrypoints into a single
workload.pycontrolled by environment variables. - Adds performance statistics collection (per-operation latency + errors) and a background reporter that periodically upserts PerfResult documents.
- Refactors workload configuration and operation helpers to be environment-variable-driven and to support timed operation wrappers.
Reviewed changes
Copilot reviewed 18 out of 18 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| sdk/cosmos/azure-cosmos/tests/workloads/workload.py | New unified workload entrypoint (sync/async + proxy + perf hooks). |
| sdk/cosmos/azure-cosmos/tests/workloads/workload_utils.py | Adds timed operation wrappers and Cosmos error status/substatus extraction for perf stats. |
| sdk/cosmos/azure-cosmos/tests/workloads/workload_configs.py | Moves workload/client configuration to environment variables with parsing/validation. |
| sdk/cosmos/azure-cosmos/tests/workloads/perf_stats.py | New thread-safe per-operation latency histograms + error aggregation (HdrHistogram-backed). |
| sdk/cosmos/azure-cosmos/tests/workloads/perf_reporter.py | New daemon-thread reporter to drain stats and upsert PerfResult/Error docs to a results Cosmos account. |
| sdk/cosmos/azure-cosmos/tests/workloads/perf_config.py | New perf reporter configuration builder from environment variables (includes git SHA + defaults). |
| sdk/cosmos/azure-cosmos/cspell.json | Adds workload/perf-specific dictionary exceptions for spell checking. |
| sdk/cosmos/azure-cosmos/tests/workloads/w_workload.py | Removes legacy write-only workload script (replaced by workload.py). |
| sdk/cosmos/azure-cosmos/tests/workloads/w_proxy_workload.py | Removes legacy write-via-proxy workload script (replaced by workload.py). |
| sdk/cosmos/azure-cosmos/tests/workloads/r_workload.py | Removes legacy read/query workload script (replaced by workload.py). |
| sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_workload.py | Removes legacy mixed workload script (replaced by workload.py). |
| sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_with_incorrect_client_workload.py | Removes legacy “incorrect client usage” script (replaced by WORKLOAD_SKIP_CLOSE). |
| sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_sync_workload.py | Removes legacy sync mixed workload script (replaced by workload.py). |
| sdk/cosmos/azure-cosmos/tests/workloads/r_w_q_proxy_workload.py | Removes legacy mixed workload + proxy script (replaced by workload.py). |
| sdk/cosmos/azure-cosmos/tests/workloads/r_proxy_workload.py | Removes legacy read/query-via-proxy workload script (replaced by workload.py). |
| sdk/cosmos/azure-cosmos/tests/workloads/setup_env.sh | Removes local environment bootstrap script (moved out of repo per PR description). |
| sdk/cosmos/azure-cosmos/tests/workloads/run_workloads.sh | Removes legacy multi-script launcher (superseded by env-var-driven workload). |
| sdk/cosmos/azure-cosmos/tests/workloads/dev.md | Removes legacy local runbook doc (moved out of repo per PR description). |
Extract _extra_kwargs(), _timed_call(), and _timed_call_async() to eliminate duplicated excluded_locations branching and timing/error recording across 6 operation functions. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
… flush lock 1. Use public azure.core.pipeline.transport import for AioHttpTransport 2. Fail fast with RuntimeError if WORKLOAD_USE_SYNC + WORKLOAD_USE_PROXY 3. Add threading.Lock around _flush(), skip final flush if thread alive Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…filing Allows spawning multiple CosmosClient instances in a single process via WORKLOAD_NUM_CLIENTS env var (default: 1). Useful for reproducing memory scaling issues with many client instances. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| # min 1 microsecond, max 60 seconds (in microseconds), 3 significant digits | ||
| self._histograms: dict[str, HdrHistogram] = {} | ||
| self._error_counts: dict[str, int] = {} | ||
| self._errors: list[dict] = [] |
There was a problem hiding this comment.
maybe put a cap to the errors we record? or first in first out type of queue, so that we only keep the last few errors?
There was a problem hiding this comment.
Addressed in 8c53514. Replaced unbounded list with collections.deque(maxlen=2000) to keep the last 2000 errors. In practice, errors at this level should be minimal since the SDK handles retries internally — what surfaces here are non-retryable failures like 404s.
| async def upsert_item_concurrently(container, excluded_locations, num_upserts): | ||
| random_item = get_existing_random_item() | ||
|
|
||
| def _do_query(): |
There was a problem hiding this comment.
I like how you did in the async version -> do_query(ri=random_item)
There was a problem hiding this comment.
Good eye — fixed in 8c53514. Updated sync version to use def _do_query(ri=random_item) matching the async pattern.
| for s in summaries: | ||
| doc = { | ||
| "id": str(uuid.uuid4()), | ||
| "partition_key": str(uuid.uuid4()), |
There was a problem hiding this comment.
curious if we can be more streamlined in generation of pk - like using something related to workload id or something else which is more stable and deterministic rather than using uuid now. Since we are really scaling here, using UUID will just create too many cross pk queries, etc.
There was a problem hiding this comment.
We only use change feed on the results container (→ ADX via Cosmos DB data connection), not cross-partition queries directly. The random partition_key distributes writes evenly across physical partitions. Could follow up with workload_id-based partitioning if we add direct querying on the results container in the future.
| "config_skip_close": skip_close, | ||
| } | ||
| try: | ||
| self._container.upsert_item(doc) |
There was a problem hiding this comment.
so every flush call upserts one item at a time? won't this be a perf bottleneck?
There was a problem hiding this comment.
At the default 5-minute report interval (PERF_REPORT_INTERVAL=300), each flush upserts ~3-4 items (one per operation type + errors). That's <0.5s of work on a background daemon thread every 5 minutes — not a bottleneck. The Python sync SDK also doesn't have a bulk/batch API. If we reduce the interval significantly, we could revisit.
…ture - perf_stats: Replace unbounded _errors list with deque(maxlen=2000). Errors at this level are minimal due to SDK internal retries. - workload_utils: Capture loop variable in sync query_items with def _do_query(ri=random_item) to match async pattern. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| tasks = [] | ||
| for i in range(WORKLOAD_NUM_CLIENTS): | ||
| client_id = f"{prefix}-c{i}" | ||
| tasks.append(run_workload_async(client_id, client_logger)) |
There was a problem hiding this comment.
🟡 Recommendation — Resource Waste in Multi-Client Mode
In multi-client mode (WORKLOAD_NUM_CLIENTS > 1), each async task creates its own Stats, PerfReporter, daemon thread, and sync CosmosClient. Three issues:
- Fragmented workload IDs: Each
Statscallsget_perf_config(), which generates a freshuuid.uuid4()forworkload_idifPERF_WORKLOAD_IDisn't set. Results from the same process get different IDs — making correlation impossible in Grafana/ADX. - Fragmented metrics: N separate summaries per flush interval, each with 1/Nth of the data. P99 from 1/5th of operations is statistically meaningless.
- Resource waste: N daemon threads + N sync
CosmosClientinstances (each with its own HTTP connection pool) just for writing metrics.
Suggestion: Create one shared Stats + PerfReporter at the run_multi_client_async level and pass it into each task. Stats is already thread-safe (uses locks).
There was a problem hiding this comment.
Addressed in 314405e. run_multi_client_async now creates a single shared Stats + PerfReporter and passes them to each run_workload_async task. Single workload_id, consolidated metrics, one daemon thread. Individual tasks only create their own reporter if none is passed (single-client mode).
| _MIN_VALUE_US, _MAX_VALUE_US, 3 | ||
| ) | ||
| self._error_counts[operation] += 1 | ||
| self._errors.append( |
There was a problem hiding this comment.
🟡 Recommendation — Unbounded Memory Growth
_errors accumulates every error as a dict (with full traceback.format_exc() strings of ~2-5KB each) with no size cap. Between 5-minute drain intervals, a sustained failure scenario could spike memory.
This tool is designed for DR drill testing — the exact scenario where error storms (region failovers) are expected. With 100 concurrent requests failing at ~12 errors/sec × 300 seconds = ~3,600 errors × ~5KB each ≈ 18MB per drain interval.
The aggregate error count is already tracked separately in _error_counts, so no count information is lost.
Suggestion: Add a maximum error buffer size (e.g., 1000). Keep _error_counts for accurate totals, cap the detailed list.
There was a problem hiding this comment.
Addressed in earlier commit (8c53514). Replaced with collections.deque(maxlen=2000) — keeps last 2000 errors (~10MB max). Errors at this level should be minimal due to SDK internal retries; what surfaces are non-retryable failures. _error_counts still tracks accurate totals regardless of the cap.
| for s in summaries: | ||
| doc = { | ||
| "id": str(uuid.uuid4()), | ||
| "partition_key": str(uuid.uuid4()), |
There was a problem hiding this comment.
🟡 Recommendation — Partition Key Anti-Pattern
Both id and partition_key are set to independent random UUIDs for every document. This creates a new logical partition for every single result document.
This is a known Cosmos DB anti-pattern:
- Querying "all results for workload X" or "all results from this host" requires expensive cross-partition fan-out queries.
- Results from the same test run are scattered across thousands of partitions.
- For ADX/Grafana dashboards doing time-series queries, this means high RU cost and latency.
Suggestion: Use workload_id as the partition key value. All results from one workload run land in one partition → efficient point reads and range scans for dashboards.
There was a problem hiding this comment.
We only use change feed on the results container (→ ADX via Cosmos DB data connection), not cross-partition queries directly. The random partition_key distributes writes evenly. Could follow up with workload_id-based partitioning if direct querying on the results container is needed in the future.
| self._errors = deque(maxlen=2000) | ||
| return summaries, error_details | ||
|
|
||
| def drain_summaries(self) -> list[dict]: |
There was a problem hiding this comment.
🟡 Recommendation — Data-Loss API Trap
drain_summaries() and drain_errors() both delegate to drain_all(), which atomically clears all histograms, error counts, and error details. Calling drain_summaries() silently and permanently discards all accumulated errors (and vice versa).
The method names suggest "drain only summaries" but the implementation drains (and discards) everything. A future caller might reasonably write:
summaries = stats.drain_summaries()
errors = stats.drain_errors() # Returns empty! Data was already cleared.Suggestion: Either (a) remove these methods — force callers to use drain_all(), or (b) add a prominent docstring warning about the destructive behavior.
There was a problem hiding this comment.
Addressed in 314405e. Removed drain_summaries() and drain_errors() — callers now use drain_all() only, which is the only method used by PerfReporter._flush().
| NUMBER_OF_LOGICAL_PARTITIONS = int( | ||
| os.environ.get("COSMOS_NUMBER_OF_LOGICAL_PARTITIONS", "10000") | ||
| ) | ||
| THROUGHPUT = int(os.environ.get("COSMOS_THROUGHPUT", "1000000")) |
There was a problem hiding this comment.
🟡 Recommendation — Silent 10x Cost Increase
Default THROUGHPUT changed from 100000 (100K RU/s) to 1000000 (1M RU/s) — a 10x increase.
The untouched initial-setup.py imports THROUGHPUT and uses it in ThroughputProperties(THROUGHPUT) when creating containers. Anyone running initial-setup.py without explicitly setting COSMOS_THROUGHPUT will provision 10x the previous default (~$80/hr vs ~$8/hr).
Suggestion: If intentional for CI pipelines, add a comment explaining the increase and document it as a breaking change. If not, revert to 100000.
There was a problem hiding this comment.
Addressed in 314405e. Reverted default to 100K RU/s. Added comment that DR drills should set COSMOS_THROUGHPUT=1000000 explicitly.
| USE_MULTIPLE_WRITABLE_LOCATIONS = ( | ||
| os.environ.get("COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONS", "false").lower() == "true" | ||
| ) | ||
| CONCURRENT_REQUESTS = int(os.environ.get("COSMOS_CONCURRENT_REQUESTS", "100")) |
There was a problem hiding this comment.
🟡 Recommendation — Crash on Invalid Input
Five config values use bare int() without error handling: CONCURRENT_REQUESTS, CONCURRENT_QUERIES, WORKLOAD_NUM_CLIENTS, NUMBER_OF_LOGICAL_PARTITIONS, and THROUGHPUT.
Meanwhile, perf_config.py in the same PR introduces _safe_int() for exactly this purpose.
A typo like COSMOS_CONCURRENT_REQUESTS=1O0 (letter O) crashes at import time with an unhelpful ValueError.
Suggestion: Use the _safe_int pattern from perf_config.py for consistency, or wrap with try/except that identifies which env var is invalid.
There was a problem hiding this comment.
Addressed in 314405e. All int env vars in workload_configs.py now use _safe_int() from perf_config.py with fallback defaults.
| sys_cpu = _get_system_cpu_percent() | ||
| sys_total, sys_used = _get_system_memory() | ||
|
|
||
| concurrency = _safe_int_env("COSMOS_CONCURRENT_REQUESTS", 100) |
There was a problem hiding this comment.
🟡 Recommendation — Configuration Inconsistency
Eight os.environ.get() / _safe_int_env() calls inside _flush() re-read configuration from the environment on every flush interval (every 5 minutes). But workload_configs.py captures these same values once at import time.
Two issues:
- Semantic inconsistency: If env vars were modified after startup (unusual but possible), the workload uses old values while metrics report new ones.
- Which is the source of truth?
workload_configs.pyreads at import;_flush()re-reads every interval.
Suggestion: Capture these values once in __init__() or start() and store as instance attributes. Or accept a config dict from the caller.
There was a problem hiding this comment.
The re-reading in _flush() is intentional — it allows config changes at runtime without restarting workloads (useful during DR drills when excluded regions change). Will refactor workload_configs.py to also re-read on access in a follow-up for full consistency.
| client_logger.error(e) | ||
| finally: | ||
| if reporter: | ||
| reporter.stop() |
There was a problem hiding this comment.
🟡 Recommendation — Inconsistent Error Handling
The sync path does not protect reporter.stop() in the finally block, but the async path (line ~102) wraps it in try/except.
If the workload throws an exception AND reporter.stop() also throws (e.g., network error flushing final metrics), the original exception is masked by the cleanup exception.
This matters because DR drills are the scenario most likely to have both workload errors AND reporter errors simultaneously.
Suggestion: Apply the same try/except pattern as the async path. Consider logging rather than silently swallowing.
There was a problem hiding this comment.
Addressed in 314405e. Added try/except around reporter.stop() in the sync path, matching the async pattern.
|
|
||
|
|
||
| def create_logger(file_name): | ||
| os.environ["AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER"] = str(CIRCUIT_BREAKER_ENABLED) |
There was a problem hiding this comment.
🟢 Suggestion — Misplaced Side Effect
A function named create_logger() shouldn't set global environment variables that control SDK behavior. The circuit breaker config isn't applied until a logger is created, which is invisible to callers.
This was flagged in a past PR review (#40843) as a misplaced concern.
Suggestion: Move the env var setting to workload_configs.py (at import time, alongside the other config), or to the __main__ block in workload.py.
There was a problem hiding this comment.
Addressed in 314405e. Moved os.environ["AZURE_COSMOS_ENABLE_CIRCUIT_BREAKER"] setting from create_logger() to workload_configs.py alongside the other config.
| client_logger.info("Exception in application layer") | ||
| client_logger.error(e) | ||
| finally: | ||
| if not WORKLOAD_SKIP_CLOSE: |
There was a problem hiding this comment.
🟢 Suggestion — Duplicated Client Constructor
Both branches construct the client identically — only the __aenter__() call differs. If constructor args are modified, both branches must be updated.
Suggestion:
client = AsyncClient(COSMOS_URI, COSMOS_CREDENTIAL, **client_kwargs)
if not WORKLOAD_SKIP_CLOSE:
await client.__aenter__()There was a problem hiding this comment.
Addressed in 314405e. Consolidated to a single constructor call with conditional __aenter__().
There was a problem hiding this comment.
Addressed in 314405e. Consolidated to a single constructor call with conditional __aenter__().
| "workload_id": self._config["workload_id"], | ||
| "commit_sha": self._config["commit_sha"], | ||
| "hostname": self._hostname, | ||
| "TIMESTAMP": now, |
There was a problem hiding this comment.
🟢 Suggestion — Inconsistent Field Naming
Every field in the PerfResult document uses snake_case except TIMESTAMP which is ALL_CAPS. This creates inconsistency in Kusto/ADX queries.
Suggestion: Rename to timestamp for consistency, or add a comment if the ALL_CAPS convention is required for Rust SDK schema compatibility.
There was a problem hiding this comment.
Keeping TIMESTAMP in ALL_CAPS — it's required for Rust SDK PerfResults schema compatibility (both SDKs write to the same ADX table). Added a comment explaining this.
There was a problem hiding this comment.
Keeping TIMESTAMP in ALL_CAPS — it's required for Rust SDK PerfResults schema compatibility (both SDKs write to the same ADX table). Added a comment explaining this.
| logger.warning("PerfReporter error upsert failed: %s", e) | ||
|
|
||
|
|
||
| def _safe_int_env(name: str, default: int) -> int: |
There was a problem hiding this comment.
🟢 Suggestion — Duplicated Safe-Int Helper
This PR introduces two nearly identical safe-int-parsing helpers:
perf_config.py:_safe_int(value, default)— parses a valueperf_reporter.py:_safe_int_env(name, default)— reads env var then parses
Bug fixes in one won't propagate to the other.
Suggestion: Consolidate into a single helper in perf_config.py and import it here.
There was a problem hiding this comment.
Addressed in 314405e. Consolidated _safe_int_env into perf_config.py and imported it in perf_reporter.py.
There was a problem hiding this comment.
Addressed in 314405e. Consolidated _safe_int_env into perf_config.py and imported it in perf_reporter.py.
…ated helpers 1. Share Stats+PerfReporter across multi-client tasks (single workload_id) 2. Remove drain_summaries/drain_errors (use drain_all only) 3. Revert THROUGHPUT default to 100K (set COSMOS_THROUGHPUT=1000000 for DR drills) 4. Use _safe_int for all int env vars in workload_configs.py 5. Add try/except to sync reporter.stop() matching async path 6. Move circuit breaker env var from create_logger to workload_configs 7. Consolidate duplicated client constructor 8. Add TIMESTAMP schema compatibility comment 9. Consolidate _safe_int_env into perf_config.py Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Unified Workload with Performance Metrics for Cosmos DB DR Drill Testing
Summary
Consolidates 8 separate workload scripts into a single
workload.pycontrolled entirely by environment variables, and adds a performance metricscollection layer that reports latency, throughput, errors, and resource
utilization to a Cosmos DB results account — enabling Grafana dashboards and ADX-powered analysis for DR drill testing.
Motivation
Previously, testing different workload configurations (read-only, write-only, proxy, sync, etc.) required copying and modifying entire workload files.
Adding observability meant manually instrumenting each file. This PR solves
both problems:
Architecture
workload.py ├── WORKLOAD_OPERATIONS → read, write, query (comma-separated) ├── WORKLOAD_USE_PROXY → route through Envoy proxy ├── WORKLOAD_USE_SYNC → sync
vs async client ├── WORKLOAD_SKIP_CLOSE → simulate applications that
don't close the client └── PerfReporter (daemon thread, configurable interval) ├── HdrHistogram per operation (p50/p90/p99, O(1) record) ├── psutil for
CPU/memory metrics ├── Error status code + sub-status code tracking └──
Upserts PerfResult docs to results Cosmos DB account
New Files
workload.pyperf_stats.pyperf_reporter.pyperf_config.pyModified Files
workload_configs.pyworkload_utils.pytime.perf_counter_ns()with error status code captureDeleted Files
r_workload.py,w_workload.py,r_proxy_workload.py,w_proxy_workload.py,r_w_q_workload.py,r_w_q_proxy_workload.py,r_w_q_sync_workload.py,r_w_q_with_incorrect_client_workload.py,run_workloads.sh,dev.mdAll replaced by
workload.pywith environment variable configuration. Infrastructure/orchestration scripts (run_workloads.sh,dev.md) moved to thecosmos-sdk-copilot-toolkit repo.
Environment Variables
Workload behavior:
WORKLOAD_OPERATIONSread,write,queryWORKLOAD_USE_PROXYfalseWORKLOAD_USE_SYNCfalseWORKLOAD_SKIP_CLOSEfalseCosmos DB client:
COSMOS_URICOSMOS_PREFERRED_LOCATIONS""COSMOS_CLIENT_EXCLUDED_LOCATIONS""COSMOS_USE_MULTIPLE_WRITABLE_LOCATIONSfalseAZURE_COSMOS_ENABLE_CIRCUIT_BREAKERfalseCOSMOS_CONCURRENT_REQUESTS100COSMOS_THROUGHPUT1000000Metrics reporting:
RESULTS_COSMOS_URI""PERF_ENABLEDtruePERF_REPORT_INTERVAL300PerfResult Document Schema
Each document matches the Rust SDK perf tool schema, enabling
both SDKs to feed the same ADX → Grafana pipeline:
{ "operation": "ReadItem", "count": 12345, "errors": 0, "min_ms": 1.2, "max_ms": 50.3, "mean_ms": 5.4, "p50_ms": 4.8, "p90_ms": 8.2, "p99_ms": 15.1, "cpu_percent": 45.2, "memory_bytes": 104857600, "system_cpu_percent": 62.1, "sdk_language": "python", "sdk_version": "4.15.0", "config_concurrency": 100, "config_ppcb_enabled": false } Usage Examples # All operations, direct connection WORKLOAD_OPERATIONS=read,write,query python3 workload.py # Read-only via Envoy proxy WORKLOAD_OPERATIONS=read WORKLOAD_USE_PROXY=true python3 workload.py # Write-only with metrics reporting WORKLOAD_OPERATIONS=write \ RESULTS_COSMOS_URI=https://my-results.documents.azure.com:443/ \ python3 workload.py # Simulate unclosed client WORKLOAD_SKIP_CLOSE=true python3 workload.py # Scale to 5 processes for i in $(seq 5); do WORKLOAD_OPERATIONS=read,write,query nohup python3 workload.py & done