Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
371 changes: 371 additions & 0 deletions .github/skills/release-monitoring-report/SKILL.md

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Large diffs are not rendered by default.

Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# Kusto operational runbook (release monitoring)

How to actually *run* the queries and feed the results into the report. The per-query
purpose + scenario→MV→column catalog lives in [`../queries/README.md`](../queries/README.md);
this file is the mechanics.

## Table of contents
- [Auth + prerequisites](#auth--prerequisites)
- [Running a query (run-kql.ps1)](#running-a-query-run-kqlps1)
- [The output JSON shape](#the-output-json-shape)
- [Filling tokens in a .kql before running](#filling-tokens-in-a-kql-before-running)
- [End-to-end loop](#end-to-end-loop)
- [Version resolution recipes](#version-resolution-recipes)
- [Hard gotchas](#hard-gotchas)

## Auth + prerequisites
- `az login` must be current (`az account show`). Kusto access is via the caller's Entra token.
- Node (for `compare-versions.js`) — `node -v`. Python is NOT assumed present.
- Both clusters are read with the **same** `run-kql.ps1`; only `-Cluster`/`-Database` differ.

## Running a query (run-kql.ps1)
`run-kql.ps1` defaults to the **Broker** cluster/db. Pipe or pass a query; capture JSON.

```powershell
$S = ".github/skills/release-monitoring-report/assets/scripts"

# Broker (defaults)
$q = Get-Content "$Q\broker-adoption.kql" -Raw
& "$S\run-kql.ps1" -Query $q -Out "$DATA\broker-adoption.json"

# Authenticator (override cluster + db)
& "$S\run-kql.ps1" -Query $q -Out "$DATA\auth-mfa-pn.json" `
-Cluster https://idsharedeus2.eastus2.kusto.windows.net `
-Database d496be22d62a46b0a3cf67ea2e736fd8
```

`run-kql.ps1` writes the JSON to the **mandatory `-Out` path** (it does not stream to stdout).
`$DATA` = the `_data/<slug>-<date>` folder `bootstrap-report.ps1` created. Keep raw payloads
there so the report is reproducible.

## The output JSON shape
`run-kql.ps1` emits **array-form**, first row = column names:

```json
{ "results": { "items": [ ["broker_version","devices"], ["16.1.0", 76400000], ["16.0.1", 585300000] ] } }
```

`compare-versions.js` reads exactly this. Do not reshape it.

## Filling tokens in a .kql before running
Templates carry `<TOKENS>` (see queries/README → token convention). Substitute in PowerShell:

```powershell
$q = (Get-Content "$Q\broker-top-errors-by-version.kql" -Raw).
Replace('<FIRST>','16.1.0').Replace('<SECOND>','16.0.1').
Replace('<START>','2026-06-01').Replace('<END>','2026-06-15')
```

For Broker `<VERSIONS>` (multi-version filter) substitute a dynamic literal:
`.Replace('<VERSIONS>','dynamic(["16.1.0","16.0.1","16.2.0"])')`.
For `<DCOUNT>` use `true` (distinct-device columns) or `false` (raw counts).

## End-to-end loop
1. **Resolve versions** — run `broker-adoption.kql` and the cheap Authenticator resolver
([recipe below](#version-resolution-recipes)). Pick `<FIRST>` (rolling out) and `<SECOND>`
(previous, by volume) unless the user named them. "Baseline = all versions" → omit the
version filter / pass every version in `<VERSIONS>`.
2. **Pull** each query for both apps into `$DATA\*.json`.
3. **Compare** — feed the version-per-row payloads to `compare-versions.js rows`, and the
error-movers payload to `compare-versions.js movers --lower-is-better true` (error-share
growth is bad). See script header for flags.
4. **Fill** the bootstrapped HTML in place with the real numbers + the verdict the deltas imply.
5. **Validate** — `validate-report.ps1 -Path <file> -BrokerVersion <bv> -AuthVersion <av>`.

## Version resolution recipes
**Authenticator (cheap, validated)** — avoid `union *`; read a high-volume MV:
```kusto
Entra_MFA_Push_Notification_And_CheckForAuth_MV_V1
| where EventDate >= datetime(<START>) and EventDate <= datetime(<END>)
| where isnotempty(AppVersion)
| summarize Devices = sum(NotificationInitiatedDCount) by AppVersion
| order by Devices desc
```
Newest `6.YYMM.BUILD` (highest `YYMM`) = current train. The two highest-volume recent
versions are usually `<FIRST>`/`<SECOND>`.

**Broker** — `broker-adoption.kql` already returns `dcount` devices by `broker_version`;
sort desc and read off the top two.

## Hard gotchas
- **Distinct devices (Broker):** `dcount_hll(hll_merge(countDevicesHll))`. Never `sum(countDevices)`.
- **Percentiles (Broker):** `percentiles_array_tdigest(tdigest_merge(responseTimeTDigest), …)`.
Never average/sum percentiles across rows.
- **Raw dims (Broker):** wrap with `MergeAccountType()` / `MergeIsSharedDevice()` if you add
account-type / shared-device filters.
- **Broker per host app:** Broker MVs (`ErrorStatsMetrics`, `*AuthStats*Metrics`,
`BrokerAdoptionStatsUpdated`) carry `active_broker_package_name` (the host app acting as broker)
and `AppInfo_Version`. For a given host package, `AppInfo_Version` **is that host app's version**
— e.g. for `com.azure.authenticator` it equals the Authenticator `AppVersion` (`6.2606.3817`),
NOT `broker_version`. To isolate "the broker as it runs inside one app's release", filter
`active_broker_package_name == "<pkg>"` and compare by `AppInfo_Version`. Don't attribute a
fleet-wide `broker_version` delta to an app — it can be dominated by another host (Link to
Windows `com.microsoft.appmanager` ≈122 M devices).
- **Device-share masks per-span spikes:** `broker-top-errors-by-host-app.kql` is a device-share
(devices hitting code X anywhere ÷ devices on that version) — it dedups a device across all spans.
A code can read flat/down there while its **per-request** rate climbs inside one `span_name` (seen:
`invalid_grant` on `AcquireTokenSilent` rose +1.19 pp while its device-share fell). Re-slice with
`broker-errors-by-host-app-span.kql` (request-level rate per span) before writing "no regression",
and separate an early-rollout spike-that-decays from a steady gap with a daily trend (rate by
version by day). For eSTS-returned codes (`invalid_grant`/`interaction_required`) correlate the
trigger to a PR in the bundled broker version range: `git log v<PREV>..v<NEW>` in `broker/`+`common/`,
then `find-suspect-prs.ps1 -Range`; weight device-PoP/PRT/cache changes.
- **Authenticator outcomes:** Registration/Auth MVs have only `Initiated/Succeeded/Failed`
(+`…DCount`) — no `Cancelled`/`PartiallySucceeded`. PN completion needs the two-table join
(init MV ⋈ `_Results_MV_V1`).
- **MSA NGC vs SA:** both the MSA PN init MV and its results MV carry `IsNGC`
(`"true"`=NGC, `"false"`=SA) — filter both join sides.
- **Volume guard:** treat scenarios with < ~1K initiates as noise, not a regression
(`compare-versions.js` `--volume-floor`). Always pull initiate volume alongside rates.
- **A moved metric is a question, not a verdict:** before calling any version-over-version delta a
regression, run the diagnostic ladder in [`investigation-patterns.md`](investigation-patterns.md) —
normalize count→rate, compare new-build vs old-build rate (substitution), the **code-frozen control**
(did the previous version's rate move too → environmental), dimensional decomposition via the
`*_Errors_MV_V1` companions, benign-vs-defect classification, raw `passkeyoperations` sub-code drill,
and the `git diff <prevTag>..<newTag>` gate-logic check.
- **Know what an MV counts before drilling:** `.show materialized-view <Name> | project Query` prints
its source table + `OperationName`/`RequestType`/`PasskeyFlow` filters — so you drill the right raw
request family. Every Authenticator scenario also has a `*_Errors_MV_V1` companion (reason × OsLevel
× AppVersion × DeviceInfoMake) for the "why".
- **UTF-8 trap:** never write report HTML through a PowerShell `@'…'@` heredoc (strips
emoji/arrows). Use `node fs.writeFileSync` or
`[IO.File]::WriteAllText($p,$t,[System.Text.UTF8Encoding]::new($false))`.
147 changes: 147 additions & 0 deletions .github/skills/release-monitoring-report/assets/queries/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,147 @@
# Release-monitoring query catalog

Templates for **version-over-version** release monitoring of the Android **Broker** and the
**Authenticator** app. Every `.kql` here is a placeholder template — substitute the
`<TOKENS>` before running. All were validated against live Kusto.

## Clusters

| App | Cluster | Database | Version dimension | Time column |
|-----|---------|----------|-------------------|-------------|
| Broker | `https://idsharedeus2.kusto.windows.net` | `ad-accounts-android-otel` | `broker_version` (e.g. `16.1.0`) | `EventInfo_Time` |
| Authenticator | `https://idsharedeus2.eastus2.kusto.windows.net` | `d496be22d62a46b0a3cf67ea2e736fd8` | `AppVersion` (e.g. `6.2606.3817`) | `EventDate` (MVs) / `EventInfo_Time` (raw `union *`) |

`run-kql.ps1` defaults to the Broker cluster/db. For Authenticator pass
`-Cluster https://idsharedeus2.eastus2.kusto.windows.net -Database d496be22d62a46b0a3cf67ea2e736fd8`.

## Shared token convention

`<FIRST>` = version **rolling out** · `<SECOND>` = **previous / baseline** version ·
`<START>` `<END>` = `yyyy-mm-dd` window bounds · `<DCOUNT>` = `true` → distinct-device
(`…DCount`) columns, `false` → raw event counts · `<VERSIONS>` = a `dynamic([...])` list
used by the Broker templates that filter many versions at once.

## Broker queries

| File | Purpose | Key tokens |
|------|---------|-----------|
| `broker-adoption.kql` | Distinct devices per `broker_version`. **Run first** to resolve exact version strings + pick `<FIRST>`/`<SECOND>` by volume. | `<START> <END>` |
| `broker-error-rate-by-version.kql` | Headline overall **device error rate** per version (devices hitting any non-success error ÷ total devices). | `<VERSIONS> <START> <END>` |
| `broker-reliability-by-version.kql` | Silent + Interactive reliability (request and device) per version from the canonical `*AllRequestsMetrics` / `*RequestsWithoutExpectedErrorMetrics` MVs. | `<VERSIONS> <START> <END>` |
| `broker-top-errors-by-version.kql` | **The "why".** Per-`error_code` device + request counts on `<FIRST>` vs `<SECOND>` with device-share delta (pp). Top regressions/improvements. | `<FIRST> <SECOND> <START> <END>` |
| `broker-latency-by-version.kql` | P50/P75/P90/P95/P99 of `responseTime` per version (optionally one `span_name`). | `<VERSIONS> <START> <END>` |
| `broker-by-host-app.kql` | **Broker scoped to ONE host app**, compared by that app's version. Headline device error rate + silent/interactive reliability for `active_broker_package_name == <PACKAGE>`, keyed on `AppInfo_Version` (= the host app's version). | `<PACKAGE> <FIRST> <SECOND> <START> <END>` |
| `broker-top-errors-by-host-app.kql` | The "why" for the host-scoped view: per-`error_code` device-share delta for one host app's two versions. | `<PACKAGE> <FIRST> <SECOND> <START> <END>` |
| `broker-errors-by-host-app-span.kql` | **Span drill-down — the complement of the device-share movers.** Per-`span_name` **request-level** rate (errored ÷ total in that span) for a specific code list, one host app, two versions. Surfaces a per-span spike that device-share dedup hides. | `<PACKAGE> <FIRST> <SECOND> <CODES> <START> <END>` |

**Broker attributed to a host app (e.g. Authenticator):** the Broker runs *inside* a host app,
and the Broker MVs also carry `active_broker_package_name` (the host) and `AppInfo_Version`
(which, for a given host package, **is that host app's version** — e.g. for
`com.azure.authenticator`, `AppInfo_Version == 6.2606.3817` is the Authenticator AppVersion, not
`broker_version`). Use the two `*-by-host-app.kql` templates to answer *"did the Authenticator
rollout move the broker?"* without contamination from other hosts. This matters because
fleet-wide `broker_version` deltas can be dominated by a host you are **not** shipping — e.g. Link
to Windows (`com.microsoft.appmanager`, ≈122 M devices) can swing an aggregate `io_error` figure
that has nothing to do with the Authenticator release. Top hosts by volume: `com.microsoft.appmanager`,
`com.azure.authenticator`, `com.microsoft.windowsintune.companyportal`.

**Broker gotchas:** distinct devices = `dcount_hll(hll_merge(countDevicesHll))` — never
`sum(countDevices)`. Never sum percentiles — `percentiles_array_tdigest(tdigest_merge(...))`.
`MergeAccountType()` / `MergeIsSharedDevice()` normalize the raw dimensions if you add filters.
**Device-share masks per-span request spikes:** `broker-top-errors-by-host-app.kql` dedups a device
across all spans, so a code can read flat/down there while its per-request rate climbs inside one
span (e.g. `invalid_grant` on `AcquireTokenSilent`). When an eSTS code is suspected, re-slice with
`broker-errors-by-host-app-span.kql` and separate early-rollout decay from a steady gap via a daily
trend before concluding. The trigger PR lives in the bundled broker version range — correlate with
`assets/scripts/find-suspect-prs.ps1 -Range v<PREV>..v<NEW>`.

## Authenticator queries

| File | Purpose | Applies to |
|------|---------|-----------|
| `auth-version-resolve.kql` | Resolve candidate `AppVersion`s (newest `yymm` = current train). Auto-detect `<FIRST>`/`<SECOND>`. Uses `union *` (heavy) — prefer the cheap fallback below if it is slow. | all |
| `auth-scenario-success-rate.kql` | Per-version Initiated/Succeeded/Failed + SuccessRate. The headline per scenario. | single-MV **Registration / Authentication** scenarios |
| `auth-scenario-initiates.kql` | Per-version initiate volume (guards against reading noise as a regression). | any scenario (swap `<INIT_COL>`) |
| `auth-pn-checkforauth-completion.kql` | Two-table join: notifications initiated vs results reaching a terminal `FinalResult`. CompletionRate / DropRate. | **PN + CheckForAuth** families (MFA / PSI / MSA) |
| `auth-reacted-notification-split.kql` | Approved / Denied / Error split of reacted notifications. | **PN + CheckForAuth Results** families |
| `auth-stats.kql` | Fleet/adoption stats: total devices, adoption-over-time, DAU, version share, OEM/OS/Country. Raw `union *`. | app-wide |
| `authenticator-crash-denominator.kql` | Active devices for `<FIRST>` **and** `<SECOND>` in one query — the denominator for crashes-per-1k-active-devices (numerator from App Center). | crash/stability layer |

### Crash / stability (Authenticator)

Crash clusters are **not** in Kusto — pull them from **App Center** with
`assets/scripts/fetch-appcenter-crashes.js`, then divide by the device counts from
`authenticator-crash-denominator.kql` for an honest crashes-per-1k rate. Read
`assets/docs/crash-sources.md` first (auth/token, the `errorGroupId`-is-version-scoped and
share-vs-rate gotchas, App Center Analytics is retired, secret handling, Play Console Phase 2).

### Cheap version-resolution fallback

`union *` in `auth-version-resolve.kql` scans every table. If it is slow, resolve versions
from a high-volume MV instead (validated):

```kusto
Entra_MFA_Push_Notification_And_CheckForAuth_MV_V1
| where EventDate >= datetime(<START>) and EventDate <= datetime(<END>)
| where isnotempty(AppVersion)
| summarize Devices = sum(NotificationInitiatedDCount) by AppVersion
| order by Devices desc
```

### Authenticator scenario → MV → column catalog

Outcome columns each have a `…DCount` distinct-device twin. **Registration / Authentication
MVs expose only `Initiated / Succeeded / Failed (+DCount)` and `TotalUniqueDevices` — there is
NO `Cancelled` / `PartiallySucceeded` column.** PN MVs carry only an initiated counter; the
terminal outcome lives in the paired `_Results_MV_V1`.

| Scenario | Registration/Auth MV (success-rate) | Initiate column | PN init MV | PN init column | PN results MV (`FinalResult`) | results init column |
|----------|-------------------------------------|-----------------|-----------|----------------|------------------------------|---------------------|
| Passkey WebAuthN Reg | `Passkey_WebAuthN_Registration_MV_V1` | `Initiated` | — | — | — | — |
| Passkey InApp Reg | `Passkey_InApp_Registration_MV_V1` | `Initiated` | — | — | — | — |
| Passkey WebAuthN Auth | `Passkey_WebAuthN_Authentication_MV_V1` | `Initiated` | — | — | — | — |
| Entra MFA Reg (QR) | `Entra_MFA_Registration_QR_Code_Flow_MV_V1` | `Initiated` | — | — | — | — |
| Entra MFA Reg (Manual/Non-QR) | `Entra_MFA_Registration_Manual_Flow_MV_V1` + `Entra_MFA_Registration_Non_QR_Code_Flow_MV_V1` | `Initiated` | — | — | — | — |
| Entra MFA PN+CFA | — | — | `Entra_MFA_Push_Notification_And_CheckForAuth_MV_V1` | `NotificationInitiated` | `Entra_MFA_Push_Notification_And_CheckForAuth_Results_MV_V1` | `RequestTimeInitiated` |
| Entra PSI Reg | `Entra_PSI_Registration_MV_V1` | `Initiated` | — | — | — | — |
| Entra PSI PN-Reg | `Entra_PSI_Push_Notification_Registration_MV_V1` | `RegistrationStarted` | — | — | — | — |
| Entra PSI PN+CFA | — | — | `Entra_PSI_Push_Notification_And_CheckForAuth_MV_V1` | `NotificationInitiated` | `Entra_PSI_Push_Notification_And_CheckForAuth_Results_MV_V1` | `RequestTimeInitiated` |
| MSA NGC Reg | `Entra_MSA_NGC_Registration_MV_V1` | `Initiated` | — | — | — | — |
| MSA SA Reg | `Entra_MSA_SA_Registration_MV_V1` | `Initiated` | — | — | — | — |
| MSA NGC/SA PN+CFA | — | — | `Entra_MSA_Push_Notification_And_CheckForAuth_MV_V1` | `NotificationReceivedInitiated` | `Entra_MSA_Push_Notification_And_CheckForAuth_Results_MV_V1` | `SessionTimeInitiated` |

**MSA NGC vs SA split:** the MSA PN init MV **and** its results MV both carry `IsNGC`
(`"true"` → NGC, `"false"` → SA). Apply the same `| where IsNGC == "..."` filter on both
sides of the join.

`FinalResult` ∈ {`Approved`, `Denied`, `Error`}. Completion = Approved+Denied ÷ initiated.

### Drilling below the outcome MVs (the "why" behind a moved metric)

The outcome MVs answer *what* (rate up/down); they do **not** explain *why*. Two layers sit beneath
them — climb down per [`../docs/investigation-patterns.md`](../docs/investigation-patterns.md):

1. **`*_Errors_MV_V1` companion (reason + dimension).** Essentially every scenario has one, named by
inserting `Errors` into the outcome MV name — e.g. `Passkey_WebAuthN_Registration_MV_V1` →
`Passkey_WebAuthN_Registration_Errors_MV_V1`, `Passkey_WebAuthN_Authentication_MV_V1` →
`Passkey_WebAuthN_Authentication_Errors_MV_V1` (PN families use `…_And_CheckForAuth_Errors_MV_V1`).
Schema is uniform: `EventDate, Error, OsLevel, AppVersion, DeviceInfoMake, ErrorCount, ErrorDCount,
TotalUniqueDevices`. This is the **reason breakdown of `Failed`**, already sliced by OS major and
OEM — exactly the dimensional decomposition (P6) and benign-vs-real classification (P5) the patterns
need. It carries **counts only**, so always pair it with the outcome MV's `Initiated` for the rate.

2. **Raw `passkeyoperations` (structured sub-code).** When `Error` is a coarse bucket, the raw table
has the finer code. Key fields: `OperationName` (`PasskeyCredentialRequest{Initiated,Succeeded,
Failed}`, plus sub-operations like `PasskeyBeginGetCredential*`), `AppInfo_Version`,
`DeviceInfo_OsVersion` (`osLevel = tostring(split(DeviceInfo_OsVersion," ")[0])`), `DeviceInfo_Make`,
`DeviceInfo_Id`, `EventInfo_Time`, and `AllProperties` (JSON string — `todynamic()` it). Useful
`AllProperties` keys: `RequestType` (`CreatePasskeyCredentialRequest` = registration,
`GetPasskeyCredentialRequest` = authentication), `PasskeyFlow`
(`WEB_AUTH_N_REGISTRATION`/`WEB_AUTH_N_AUTHENTICATION`/`IN_APP_REGISTRATION`), `Error`,
`ErrorSource`, `IsCrossDevice`, `DeviceUnauthenticatedErrorCode` (Android `BiometricPrompt` code —
5/10/13/14 = abandonment, 1/7/9 = device/hard), `DeviceUnauthenticatedErrorMessage`, `Source`.

3. **Know what a metric counts before you drill (P9):** `.show materialized-view <Name> | project Query`
reveals the source table and the `OperationName`/`RequestType`/`PasskeyFlow` filters — e.g.
Registration MVs count only `CreatePasskeyCredentialRequest`, Authentication only
`GetPasskeyCredentialRequest` — so you query the right request family in the raw table.
Loading
Loading