SQL-414: Do allowed_roles check on internal endpoints by SangJunBak · Pull Request #37260 · MaterializeInc/materialize

SangJunBak · 2026-06-23T21:09:18Z

There exists a security bug where self managed environments with password/oidc auth can have
any non-system user access our internal HTTP routes. These include injecting audit events,
leader promotion, and etc. This PR fixes this by extending the schema of our server listener configurations.

Motivation

Fixes https://linear.app/materializeinc/issue/SQL-414/http-endpoints-only-authenticate-but-dont-authorize

Description

To review, I'd recommend reading the plan in the commit "Outline the problem and solution"
769b7bd (this PR) to understand the problem. The solution proposed there is quite 1:1 with the implementation. The implementation aims to simplify the authorization code in our web server handling, so much of the code added in this PR are simple refactors. I'd recommend reviewing commit by commit.

Verification

We base it off the regression test Dennis initially created and make sure it passes our existing test suite.

def-

Nightly failures look very related! https://buildkite.com/materialize/nightly/builds/16892

def- · 2026-06-26T11:11:41Z

+            # Older environmentd requires a listeners config but only
+            # parses the legacy bool-route schema, so swap to the matching file
+            # under `listener_configs/v0_147_0/` for those versions.
+            if image_version is not None and image_version < "v26.31.0-dev":


We're on v26.32.0-dev, rebase and bump

def- · 2026-06-26T11:13:25Z

Are we ok with this file allowing full access to internal routes for the emulator? Edit:

diff --git a/test/http-auth/mzcompose.py b/test/http-auth/mzcompose.py index 74cb1a914c..ba5f737f23 100644 --- a/test/http-auth/mzcompose.py +++ b/test/http-auth/mzcompose.py @@ -111,6 +111,12 @@ def workflow_default(c: Composition) -> None: check_livez_coordinator_coupling(c) + check_emulator_baked_in_password_config(c) + + +def workflow_emulator_password_config(c: Composition) -> None: + check_emulator_baked_in_password_config(c) + def check_livez_coordinator_coupling(c: Composition) -> None: """Regression test for SQL-343: previously having group_claim_for above @@ -198,3 +204,36 @@ def check_internal_endpoints_require_authorization( f"/api/catalog/inject-audit-events: status={r.status_code}, " f"audit rows={forged_rows}" ) + + +def check_emulator_baked_in_password_config(c: Composition) -> None: + with c.override( + Materialized( + listeners_config_path=f"{MZ_ROOT}/src/materialized/ci/listener_configs/v26_31_0/password.json" + ) + ): + c.up("materialized") + ext = f"http://localhost:{c.port('materialized', 6876)}" + + # mz_system authenticates with the baked-in login password + # (MZ_EXTERNAL_LOGIN_PASSWORD_MZ_SYSTEM) to provision a normal login role. + admin = ("mz_system", "password") + normal = ("sql414_normal", "sql414_normal_pw") + r = requests.post( + f"{ext}/api/sql", + auth=admin, + json={"query": f"CREATE ROLE {normal[0]} LOGIN PASSWORD '{normal[1]}'"}, + ) + assert r.status_code == 200, f"role setup failed: {r.status_code}: {r.text}" + + # `/api/catalog/dump` and `/api/coordinator/dump` are the `internal` + # group, `/prof/` the `profiling` group. Coordinator dump carries pgwire + # cancellation secret keys. + for path in ("/api/catalog/dump", "/api/coordinator/dump", "/prof/"): + slug = path.strip("/").replace("/", "_") + with c.test_case(f"emulator_password_{slug}_blocks_normal_user"): + r = requests.get(f"{ext}{path}", auth=normal) + assert r.status_code in (401, 403), ( + f"normal role {normal[0]} reached {path} on the external " + f"emulator port: status={r.status_code}, {len(r.text)} bytes" + )

Running bin/mzcompose --find http-auth run emulator-password-config fails with:

==> Running test case emulator_password_prof_blocks_normal_user ==> mzcompose: test case emulator_password_prof_blocks_normal_user failed: builtins.AssertionError: normal role sql414_normal reached /prof/ on the external emulator port: status=200, 1564 bytes Traceback (most recent call last): File "/home/deen/git/materialize/misc/python/materialize/mzcompose/composition.py", line 740, in test_case yield File "/home/deen/git/materialize/test/http-auth/mzcompose.py", line 250, in check_emulator_baked_in_password_config assert r.status_code in (401, 403), ( ^^^^^^^^^^^^^^^^^^^^^^^^^^^ AssertionError: normal role sql414_normal reached /prof/ on the external emulator port: status=200, 1564 bytes

Hmm I can see what happens if I make it more strict, but I essentially copied over the same configuration as the old listener config for password.json.

I may follow up for this however in a followup PR given my changes doesn't change the existent behavior. Also I can see how it'd be convenient to relax the access control in the listener configs for our tests

SangJunBak · 2026-06-26T16:28:54Z

Converting into draft to address nightlies

Copy over the test Dennis created for this bug

I've outlined the root cause of the bug. It would be easy to do a one-off fix and do an authorization check for the internal routes, however that would violate the semantics of the listener config and make our implementation more confusing than it already is. The solution I propose aims to make our server code simpler while preserving the config's semantics, at the cost of a bigger refactor.

We modify the schema of the listener config: - Creates new RouteGroup struct rather than a bool representing if a group of routes is enabled - Lowers `allowed_roles` to each RouteGroup - Creates a separate JSON serialization representing Rust's associated enum type We also implement a single middleware for all authorization and decouple it from authentication - We take advantage that all authentication creates an AuthedUser. Since Authz/authorization is just a function of the user name and the route's policy, We can create a middleware to keep concerns separate. We already pass along the AuthedUser as an extension, so we can attach another extension to carry over the route's policy.

We update test utils and misc to follow the new schema

SangJunBak · 2026-06-27T02:23:07Z

@def- I looked into why many of the nightlies (scalability, feature, workload replay) are failing, and it seems to be because each are comparing the branch's version against current main. However because main is at the same dev version (with a different sha) but doesn't contain my listener config changes, my version guard fails and the "ancestor" version panics. It seems to me like this is self healing however once this PR gets merged since future dev versions of v26.32.0 will know how to support the new listener config.

For fast lookup, this my commit for version guarding my change in Materialized: 27f5f57 (this PR)

Other nightlies (like the platform checks) I've fixed as well.

def-

New issue:

diff --git a/ci/nightly/pipeline.template.yml b/ci/nightly/pipeline.template.yml
index b511272acd..5f4113eb62 100644
--- a/ci/nightly/pipeline.template.yml
+++ b/ci/nightly/pipeline.template.yml
@@ -2456,6 +2456,20 @@ steps:
         agents:
           queue: hetzner-aarch64-16cpu-32gb

+      - id: orchestratord-listener-config-version-skew
+        label: Orchestratord listener config version skew
+        artifact_paths: ["mz_debug_*.zip"]
+        depends_on: devel-docker-tags
+        timeout_in_minutes: 60
+        plugins:
+          - ./ci/plugins/mzcompose:
+              composition: orchestratord
+              run: listener-config-version-skew
+              args: [--recreate-cluster]
+              ci-builder: stable
+        agents:
+          queue: hetzner-aarch64-16cpu-32gb
+
       - id: orchestratord-oidc-auth
         label: Orchestratord OIDC auth end-to-end
         artifact_paths: ["mz_debug_*.zip"]
diff --git a/test/orchestratord/mzcompose.py b/test/orchestratord/mzcompose.py
index ff75fd5a8b..f1d318ab72 100644
--- a/test/orchestratord/mzcompose.py
+++ b/test/orchestratord/mzcompose.py
@@ -3528,6 +3528,127 @@ def workflow_revert_rollout(c: Composition, parser: WorkflowArgumentParser) -> N
     )


+# environmentd in this half-open band only understands the legacy listener-config
+# schema; the new schema's parser first ships at the upper bound. The buggy
+# orchestratord gate sits at the lower bound, so it mis-serves the new schema to
+# this whole band. See `workflow_listener_config_version_skew`.
+LEGACY_LISTENER_SCHEMA_BAND = ("v26.31.0-dev.0", "v26.32.0-dev.0")
dev git diff > out                                  ~/git/materialize3@jun/fix-internal-authorization-bug
dev cat out                                         ~/git/materialize3@jun/fix-internal-authorization-bug
diff --git a/ci/nightly/pipeline.template.yml b/ci/nightly/pipeline.template.yml
index b511272acd..5f4113eb62 100644
--- a/ci/nightly/pipeline.template.yml
+++ b/ci/nightly/pipeline.template.yml
@@ -2456,6 +2456,20 @@ steps:
         agents:
           queue: hetzner-aarch64-16cpu-32gb

+      - id: orchestratord-listener-config-version-skew
+        label: Orchestratord listener config version skew
+        artifact_paths: ["mz_debug_*.zip"]
+        depends_on: devel-docker-tags
+        timeout_in_minutes: 60
+        plugins:
+          - ./ci/plugins/mzcompose:
+              composition: orchestratord
+              run: listener-config-version-skew
+              args: [--recreate-cluster]
+              ci-builder: stable
+        agents:
+          queue: hetzner-aarch64-16cpu-32gb
+
       - id: orchestratord-oidc-auth
         label: Orchestratord OIDC auth end-to-end
         artifact_paths: ["mz_debug_*.zip"]
diff --git a/test/orchestratord/mzcompose.py b/test/orchestratord/mzcompose.py
index ff75fd5a8b..fb46a01c2b 100644
--- a/test/orchestratord/mzcompose.py
+++ b/test/orchestratord/mzcompose.py
@@ -3528,6 +3528,171 @@ def workflow_revert_rollout(c: Composition, parser: WorkflowArgumentParser) -> N
     )


+# environmentd in this half-open band only understands the legacy listener-config
+# schema; the new schema's parser first ships at the upper bound. The buggy
+# orchestratord gate sits at the lower bound, so it mis-serves the new schema to
+# this whole band. See `workflow_listener_config_version_skew`.
+LEGACY_LISTENER_SCHEMA_BAND = ("v26.31.0-dev.0", "v26.32.0-dev.0")
+
+
+def newest_legacy_listener_schema_version() -> MzVersion | None:
+    """Newest published environmentd in the band an operator built from this repo
+    can mis-serve, or None if empty. Queries the operator helm index directly
+    because `fetch_self_managed_versions` drops the prereleases (v26.31.0-rc.*)
+    that currently populate the band."""
+    low, high = (MzVersion.parse_mz(v) for v in LEGACY_LISTENER_SCHEMA_BAND)
+    index = yaml.safe_load(
+        requests.get(
+            "https://materializeinc.github.io/materialize/index.yaml", timeout=30
+        ).text
+    )
+    in_band = sorted(
+        v
+        for entry in index["entries"]["materialize-operator"]
+        if low <= (v := MzVersion.parse_mz(entry["appVersion"])) < high
+    )
+    return in_band[-1] if in_band else None
+
+
+def get_listeners_config() -> dict[str, Any] | None:
+    """Parsed `listeners.json` from the rendered listeners configmap, or None if
+    the operator has not created it yet."""
+    cms = spawn.capture(
+        ["kubectl", "get", "cm", "-n", "materialize-environment", "-o", "name"],
+        stderr=subprocess.STDOUT,
+    ).splitlines()
+    cm = next((cm for cm in cms if "listeners" in cm), None)
+    if cm is None:
+        return None
+    return json.loads(
+        spawn.capture(
+            [
+                "kubectl",
+                "get",
+                cm,
+                "-n",
+                "materialize-environment",
+                "-o",
+                r"jsonpath={.data.listeners\.json}",
+            ],
+            stderr=subprocess.STDOUT,
+        )
+    )
+
+
+def environmentd_container_status() -> dict[str, Any] | None:
+    """containerStatuses[0] of the environmentd pod, or None if not present yet."""
+    try:
+        items = get_environmentd_data().get("items", [])
+    except subprocess.CalledProcessError:
+        return None
+    if not items:
+        return None
+    statuses = items[0].get("status", {}).get("containerStatuses")
+    return statuses[0] if statuses else None
+
+
+def workflow_listener_config_version_skew(
+    c: Composition,
+    parser: WorkflowArgumentParser,
+) -> None:
+    # Regression test for the stale orchestratord listener-config version gate
+    # (`PER_ROUTE_GROUP_ROLES_VERSION` in
+    # src/orchestratord/src/controller/materialize/generation.rs). environmentd
+    # parses MZ_LISTENERS_CONFIG_PATH strictly with no fallback, and the operator
+    # must serve each environmentd (per its target image ref) the schema its
+    # binary understands. A gate set below where the new schema actually ships
+    # serves the new schema to a legacy-only environmentd, which then crash-loops
+    # on boot. Other workflows miss this: they move operator and environmentd in
+    # lockstep, and the affected band holds only prereleases that
+    # get_all_self_managed_versions filters out. So we keep the operator on this
+    # repo's build and pin environmentd back to a published band version.
+    parser.add_argument(
+        "--recreate-cluster",
+        action=argparse.BooleanOptionalAction,
+        help="Recreate cluster if it exists already",
+    )
+    parser.add_argument("--tag", type=str, help="Custom version tag to use")
+    parser.add_argument(
+        "--orchestratord-override",
+        default=True,
+        action=argparse.BooleanOptionalAction,
+        help="Override orchestratord tag",
+    )
+    args = parser.parse_args()
+
+    skew_version = newest_legacy_listener_schema_version()
+    if skew_version is None:
+        print(f"No published environmentd in {LEGACY_LISTENER_SCHEMA_BAND}, skipping")
+        return
+
+    definition = setup(c, args)
+    definition["materialize"]["spec"]["environmentdImageRef"] = get_image(
+        c.compose["services"]["environmentd"]["image"], str(skew_version)
+    )
+    print(f"operator {get_version(args.tag)} managing environmentd {skew_version}")
+
+    init(definition)
+    apply_materialize(definition)
+
+    # Let environmentd boot on the config the operator served it, and watch for a
+    # crash loop (the bug) vs a healthy start (the fix). The legacy binary cannot
+    # parse the new per-route-group schema, so it errors out of run() on boot and
+    # k8s backs it off into CrashLoopBackOff.
+    crashed = False
+    for _ in range(120):
+        time.sleep(5)
+        cs = environmentd_container_status()
+        if cs is None:
+            continue
+        if cs.get("ready"):
+            break
+        waiting = cs.get("state", {}).get("waiting", {}).get("reason")
+        if waiting == "CrashLoopBackOff" or cs.get("restartCount", 0) >= 3:
+            crashed = True
+            break
+
+    if crashed:
+        served = get_listeners_config() or {}
+        print(f"operator served listeners.json version={served.get('version')!r}")
+        # The previous (terminated) container's logs hold the listener-config
+        # parse error that aborted boot.
+        for cmd in (
+            [
+                "kubectl",
+                "logs",
+                "-l",
+                "app=environmentd",
+                "-n",
+                "materialize-environment",
+                "--previous",
+                "--tail=50",
+            ],
+            [
+                "kubectl",
+                "describe",
+                "pod",
+                "-l",
+                "app=environmentd",
+                "-n",
+                "materialize-environment",
+            ],
+        ):
+            try:
+                spawn.runv(cmd)
+            except subprocess.CalledProcessError:
+                pass
+        raise AssertionError(
+            f"environmentd {skew_version} crash-looped on boot: the operator "
+            f"served it the new listener-config schema it cannot parse. The "
+            f"PER_ROUTE_GROUP_ROLES_VERSION gate is stale (must match "
+            f"{LEGACY_LISTENER_SCHEMA_BAND[1]})."
+        )
+
+    # Not crash-looping: confirm environmentd booted and the rollout completed.
+    post_run_check(definition, expect_fail=False)
+
+
 def workflow_rollout_timeout(c: Composition, parser: WorkflowArgumentParser) -> None:
     # Tests CLO-81: orchestratord automatically cancels an in-progress rollout
     # once it has been running longer than `spec.rolloutRequestTimeout`. A new

Running bin/mzcompose --find orchestratord run listener_config_version_skew --recreate-cluster crash loops:

2026-06-27T07:09:53.658741Z  thread 'main' panicked at src/environmentd/src/environmentd/main.rs:654:9:
environmentd: fatal: invalid type: map, expected a boolean at line 1 column 361
   5: core::panicking::panic_fmt
   6: mz_environmentd::environmentd::main::main
   7: environmentd::main
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

The issue seems to just be the version gate being wrong now.

Thanks for explaining the nightly failures, they make sense! A bit unfortunate that we'll only see the tests succeeding after merging to main, but ok for now.

def- · 2026-06-27T07:09:35Z

+/// `v0_147_0::ListenersConfig` schema, so we serve that to them.
+static PER_ROUTE_GROUP_ROLES_VERSION: LazyLock<Version> = LazyLock::new(|| Version {
+    major: 26,
+    minor: 31,


This is wrong, we're on 32

Suggested change

minor: 31,

minor: 32,

Validated that the test you posted is fixed in another PR #37326 where I bumped the version number 32. I feel like we should make this workflow more generic since it seems important to have a workflow where orchestratord is at a higher version than environmentd, but I might do that in a followup PR.

- Version gate the new listener config - Implement a migration from the old listener config to the current one to avoid having to redefine one in orchestratord when gating. - This is the change that will actually have an effect for all deployments in Cloud and self managed

We now match what we'd deploy in production and set allowed_roles to Internal for the internal/profiling routes

Almost all of these changes are simple one line refactors / moving files around. - For Password/SASL/OIDC listener configs, set internal from NormalAndInternal to Internal to maintain consistency - Move v0_145_0 configs to its own folder, as well as v26_31_0 - Update materialized docker image to copy a version's folder

We create a separate function for resolving the listener config depending on the version / whether there's an upgrade. This is important such that our upgrade tests don't break. For anything that above that overrides these listener configs, we override these too

We can now remove the plan. It was mostly for the purpose of documentation / helping the reviewer.

alex-hunt-materialize · 2026-06-29T11:35:09Z

 use serde::{Deserialize, Serialize};

+const LISTENERS_CONFIG_VERSION_0_147_0: &str = "0.147.0";
+const LISTENERS_CONFIG_VERSION_26_32_0: &str = "26.32.0";


Should we use an enum wrapper instead? Having version constants that we match on and manually set feels bad. Environmentd doesn't verify the version field anyway.

Maybe we could have this just have two isolated modules with their own ListenersConfig types, and then in Orchestratord we could operate on something like:

#[derive(Serialize, Deserialize)] #[serde(tag = "version")] pub enum VersionedListenersConfig { #[serde(rename = "0.147.0")] V0_147_0(v0_147_0::ListenersConfig), #[serde(rename = "26.32.0")] V26_32_0(v26_32_0::ListenersConfig), }

Environmentd would then only need to know about v26_32_0::ListenersConfig.

This makes sense to me!

Resolved here c89af1f

…enum Upon Alex's request, we simplify the constants by converting it into an enum that serializes the "version" field via serde macro arguments. Note: we name the values of the enum V1 and V2 since we have a Rust lint rule that bans camel case for enum values (underscores count as camel case)

def-

No further complaints from QA side

aljoscha

Nice!

aljoscha · 2026-06-30T08:32:36Z

+/// Version at which HTTP listeners moved `allowed_roles` to be per route group
+/// (the `ListenersConfig` schema). Older `environmentd` parses the legacy
+/// `v0_147_0::ListenersConfig` schema, so we serve that to them.
+static PER_ROUTE_GROUP_ROLES_VERSION: LazyLock<Version> = LazyLock::new(|| Version {


Gotta make sure this is up to date when merging

SangJunBak requested a review from alex-hunt-materialize June 23, 2026 21:09

SangJunBak force-pushed the jun/fix-internal-authorization-bug branch 6 times, most recently from f3062db to 1294913 Compare June 26, 2026 07:30

SangJunBak removed the request for review from alex-hunt-materialize June 26, 2026 07:35

SangJunBak force-pushed the jun/fix-internal-authorization-bug branch from 1294913 to 7be0604 Compare June 26, 2026 07:52

SangJunBak marked this pull request as ready for review June 26, 2026 07:53

SangJunBak requested review from a team and ggevay as code owners June 26, 2026 07:53

SangJunBak requested review from alex-hunt-materialize and jubrad and removed request for jubrad June 26, 2026 07:53

def- requested changes Jun 26, 2026

View reviewed changes

SangJunBak mentioned this pull request Jun 26, 2026

dev docs: update guidance on code changes #37274

Merged

SangJunBak marked this pull request as draft June 26, 2026 16:29

SangJunBak added 4 commits June 26, 2026 16:34

test: Copy over Dennis' regression test

02bd5d2

Copy over the test Dennis created for this bug

environmentd,sqllogictest: Update misc with new schema

6eb5f99

We update test utils and misc to follow the new schema

SangJunBak force-pushed the jun/fix-internal-authorization-bug branch 3 times, most recently from c85efd3 to 9472d30 Compare June 27, 2026 01:14

SangJunBak requested review from aljoscha and removed request for ggevay June 27, 2026 01:15

SangJunBak requested a review from def- June 27, 2026 02:23

SangJunBak marked this pull request as ready for review June 27, 2026 02:23

def- requested changes Jun 27, 2026

View reviewed changes

SangJunBak added 6 commits June 27, 2026 07:01

test: Fix listener config for regression test

f1a1565

We now match what we'd deploy in production and set allowed_roles to Internal for the internal/profiling routes

Adjust regression test for /api/login

fa83705

Remove temporary plan

8fdf1fd

We can now remove the plan. It was mostly for the purpose of documentation / helping the reviewer.

SangJunBak force-pushed the jun/fix-internal-authorization-bug branch from 9472d30 to 8fdf1fd Compare June 27, 2026 11:42

SangJunBak mentioned this pull request Jun 27, 2026

Regression test for auth bug #37326

Closed

alex-hunt-materialize approved these changes Jun 29, 2026

View reviewed changes

SangJunBak requested a review from def- June 29, 2026 14:21

SangJunBak force-pushed the jun/fix-internal-authorization-bug branch from dcea41e to c89af1f Compare June 29, 2026 17:47

def- approved these changes Jun 29, 2026

View reviewed changes

aljoscha approved these changes Jun 30, 2026

View reviewed changes

SangJunBak merged commit 59b1f16 into MaterializeInc:main Jun 30, 2026
128 checks passed

Uh oh!

Conversation

SangJunBak commented Jun 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Description

Verification

Uh oh!

def- left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

def- Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SangJunBak Jun 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SangJunBak commented Jun 26, 2026

Uh oh!

SangJunBak commented Jun 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

def- left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SangJunBak Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

def- left a comment

Choose a reason for hiding this comment

Uh oh!

aljoscha left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

SangJunBak commented Jun 23, 2026 •

edited

Loading

def- left a comment •

edited

Loading

def- Jun 26, 2026 •

edited

Loading

SangJunBak Jun 26, 2026 •

edited

Loading

SangJunBak commented Jun 27, 2026 •

edited

Loading

SangJunBak Jun 29, 2026 •

edited

Loading