fix(workflows): in-place updates, deletion-forgery fix, webhook-secret & channel immutability#1340
Conversation
…ent UUID Signed-off-by: Aaron Goldsmith <aargoldsmith@gmail.com>
…nd enforce channel immutability on updates Signed-off-by: Aaron Goldsmith <aargoldsmith@gmail.com>
…hannel_id update scenario Signed-off-by: Aaron Goldsmith <aargoldsmith@gmail.com>
AaronGoldsmith
left a comment
There was a problem hiding this comment.
Self-review: flagging the non-obvious / higher-blast-radius bits so they're easy to scrutinize.
| match state.db.get_workflow(wf_id).await { | ||
| Ok(wf) => { | ||
| if wf.owner_pubkey != owner_bytes { | ||
| return Err(anyhow::anyhow!("forbidden: deletion owner mismatch")); |
There was a problem hiding this comment.
🔒 Security (deletion forgery). NIP-09 a-tag deletion previously only checked the actor matched the coordinate pubkey, then deleted by UUID. A forged 30620:<attacker>:<victim_uuid> could delete another user's workflow. We now load the row and assert owner_pubkey before deleting. This extra lookup is the ownership gate, not a redundant query.
| } | ||
|
|
||
| // Check if workflow already exists to perform update or create checks | ||
| let existing = state.db.get_workflow(workflow_id).await; |
There was a problem hiding this comment.
🔍 Ordering is load-bearing (info-leak). The is_member_cached gate above (line 580) runs before this lookup on purpose, so unauthorized users can't probe for valid workflow UUIDs via differing error responses. Please don't hoist this lookup earlier for efficiency — it silently reintroduces the leak.
| "forbidden: cannot update a workflow owned by another user".into(), | ||
| )); | ||
| } | ||
| if record.channel_id != Some(channel_id) { |
There was a problem hiding this comment.
⛓️ Channel immutability + legacy NULL. record.channel_id != Some(channel_id) is deliberate: it rejects both some→some channel moves and the legacy NULL→some case. A naive unwrap/compare would let legacy global workflows be silently re-scoped.
| let secret = webhook_secret::generate_webhook_secret(); | ||
| webhook_secret::inject_secret(&mut definition_json, &secret); | ||
| Some(secret) | ||
| let existing_secret = existing_record |
There was a problem hiding this comment.
🔑 Webhook secret preservation. On update we re-inject the existing secret from the stored row (returning None so it isn't rotated/re-surfaced). Previously every edit regenerated _webhook_secret, breaking existing webhook callers.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9e34ed4b9e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| state | ||
| .db | ||
| .update_workflow(workflow_id, &workflow_name, &definition_json_final, &hash) | ||
| .await | ||
| .map_err(|e| IngestError::Internal(format!("error: db update_workflow: {e}")))?; |
There was a problem hiding this comment.
Ignore stale workflow definition replays
When an older kind:30620 event for the same workflow is delivered after a newer one (for example after a reconnect/retry, while still inside the relay's timestamp window), this unconditional update_workflow overwrites the workflows table with the stale YAML. The command path uses persist_command_event rather than the NIP-33 replace_parameterized_event stale-write check, so clients will still see the newer event as the latest while the workflow engine reads and runs the older definition from the DB. Please compare the incoming event's (created_at, id) against the current live coordinate before applying the update, or route workflow defs through the same replacement logic.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
🤖 Addressed in efa13ed.
The command path now goes through the same NIP-33 replacement logic as the normal store path. Rather than duplicate the check, I extracted the lock + (created_at, id) dominance test + soft-delete into a shared buzz_db::nip33_stale_write_guard(&mut conn, …) and call it from both Db::replace_parameterized_event and persist_command_event, so the two paths cannot drift.
- A dominated (older/retried) event now returns
PersistResult::Duplicate, so the handler skipsupdate_workflowentirely — the stale YAML never reaches the DB, and clients no longer see a newer event than the engine runs. - Among command kinds, only
KIND_WORKFLOW_DEF(30620) is parameterized-replaceable (DM/approval kinds are 41xxx/46xxx), so behavior is unchanged for the rest. - Covered by
test_workflow_update_and_delete(passing against a branch-built relay). Note: Nostrcreated_atis second-granularity, so same-second edits are resolved by the id tie-break — documented inline where the test sleeps 1s to stay deterministic.
👍 Useful catch — thanks.
Workflow def events (kind 30620) are routed to the command executor and never reach Db::replace_parameterized_event, so persist_command_event inserted them with plain ON CONFLICT DO NOTHING and no (created_at, id) replacement check. An older event delivered after a newer one (reconnect /retry within the timestamp window) was inserted and applied, overwriting the workflows row with stale YAML while clients still saw the newer event as latest. Extract the lock + dominance check + soft-delete into a single shared buzz_db::nip33_stale_write_guard(&mut conn, ...) and call it from both Db::replace_parameterized_event and persist_command_event, so the command and normal store paths apply identical NIP-33 replacement rules and cannot drift. A dominated event is reported as an idempotent duplicate, so the handler skips its mutation. Only KIND_WORKFLOW_DEF among command kinds is parameterized-replaceable (DM/approval kinds are 41xxx/46xxx), so behavior is unchanged for the rest. Document in the e2e test why a 1s sleep is required: created_at is second-granularity and same-second edits are resolved by the id tie-break. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> Signed-off-by: Aaron Goldsmith <aargoldsmith@gmail.com>
8889232 to
efa13ed
Compare
There was a problem hiding this comment.
These comments are from my buzz agent :)
Adversarial review: I found one CHANGE blocker.
CHANGE: workflow deletion removes the DB row but leaves the live kind:30620 definition event queryable
The PR fixes the DB-side workflow delete path, but the NIP-09 a-tag workflow branch still does not soft-delete the parameterized-replaceable event row. In handle_a_tag_deletion, the KIND_WORKFLOW_DEF UUID branch loads the workflow, checks owner, and calls state.db.delete_workflow(wf_id), but then stops there (crates/buzz-relay/src/handlers/side_effects.rs). The generic NIP-33 soft_delete_by_coordinate branch is explicitly skipped for workflow defs.
That means an authorized workflow deletion can stop the scheduler/webhook DB record, but the latest kind:30620 definition event remains live in events.
Why this matters:
- Desktop/CLI workflow reads are event-backed, not DB-backed. Desktop
get_channel_workflowsqueries relay events withkinds: [30620], "#h": [channel_id], andget_workflowquerieskinds: [30620], "#d": [workflow_id](desktop/src-tauri/src/commands/workflows.rs). CLI does the same (crates/buzz-cli/src/commands/workflows.rs). - Since the workflow definition event is not soft-deleted, a deleted workflow can still appear in the workflows list/detail after refresh even though the DB row is gone.
- The new e2e coverage only checks
db.get_workflow(workflow_id)returnsNotFoundafter delete, so it misses the client-facing read model regression.
Suggested fix: when a workflow delete is authorized, also soft-delete the live kind:30620 coordinate (KIND_WORKFLOW_DEF, owner_pubkey, d_tag) inside the workflow branch (or fall through/share the generic NIP-33 coordinate delete after the owner check). Then add a regression assertion that a relay query for kind:30620 + #d=<workflow_id> returns no live definition after authorized deletion.
Otherwise
The in-place update model, owner/channel checks before update, webhook secret preservation, and shared NIP-33 stale-write guard are the right direction. I specifically like moving workflow-def command events onto the same (created_at, id) dominance rule as normal parameterized-replaceable events.
Verdict: Request changes until workflow deletion also removes the live definition event from event queries.
After delete_workflow removes the DB row, also call soft_delete_by_coordinate so relay REQs stop returning the live kind:30620 event. Clients (Desktop/CLI) read workflows from events, not the DB, so a DB-only delete left deleted workflows visible. Adds regression assertion in test_workflow_update_and_delete: after authorized NIP-09 a-tag delete, a relay REQ for kind:30620 + #d=<id> must return no events. Fixes: wesbillman CHANGES_REQUESTED blocker on PR block#1340 Co-authored-by: Aaron Goldsmith <aargoldsmith@gmail.com> Signed-off-by: Aaron Goldsmith <aargoldsmith@gmail.com>
The update_workflow SQL only set name/definition/definition_hash, leaving updated_at frozen at create time. There is no DB trigger to auto-update it, so DB consumers and diagnostics always saw stale modification times. Add updated_at = NOW() to the SET clause. Co-authored-by: Aaron Goldsmith <aargoldsmith@gmail.com> Signed-off-by: Aaron Goldsmith <aargoldsmith@gmail.com> Signed-off-by: Aaron Goldsmith <aarong@squareup.com> Co-authored-by: Codex <noreply@openai.com>
Summary
Workflows that were edited or deleted kept running concurrently. The root cause: the relay called
create_workflow, which generated a random DB primary key on every publish, so each edit inserted a new active row (the old one kept firing) and client-side deletions never matched a row.This PR makes the client-supplied UUID (the Nostr parameterized-replaceable-event
dtag) the stable primary key, so edits update in place and deletions resolve correctly. It also fixes three security/stability issues found in review:handle_workflow_defresolves the existing row by UUID and callsupdate_workflowinstead of unconditionally inserting. No more duplicate concurrent cron jobs.a-tag deletion previously only checked the actor matched the coordinate pubkey, then deleted by UUID without confirming the DB row's owner. A forged30620:<attacker>:<victim_uuid>could delete another user's workflow. Now the row is loaded andowner_pubkeyis verified before deleting._webhook_secret, breaking existing callers. The existing secret is now extracted from the stored row and re-injected on update.record.channel_id != Some(channel_id)rejects moving a workflow between channels, including the legacyNULL-channel case.replace_parameterized_event, so an older event delivered after a newer one (reconnect/retry) could overwrite the row with stale YAML. The lock +(created_at, id)dominance check + soft-delete is now extracted into a sharedbuzz_db::nip33_stale_write_guard, called by bothreplace_parameterized_eventandpersist_command_eventso the two paths can't drift. Dominated events become idempotent duplicates and skip the mutation.Changes
buzz-db/src/workflow.rs—create_workflownow takesid: Uuidinstead of generating one internally.buzz-db/src/lib.rs— extractednip33_stale_write_guard(&mut conn, …);replace_parameterized_eventnow calls it (behavior unchanged) so the command path can share it.buzz-relay/src/handlers/command_executor.rs—handle_workflow_defparses thed-tag UUID, looks up any existing row (after the membership gate), enforces owner + channel-immutability, preserves the webhook secret, and routes toupdate_workflowvscreate_workflow.persist_command_eventapplies the shared NIP-33 stale-write guard for parameterized-replaceable command kinds.buzz-relay/src/handlers/side_effects.rs—handle_a_tag_deletionloads the workflow and assertsowner_pubkeybefore deleting; missing rows are logged and treated as a no-op.buzz-test-client/tests/e2e_workflows.rs—test_workflow_update_and_deleteexpanded to assert in-place update, webhook-secret stability, channel-change rejection, and deletion-forgery rejection (asserting the row persists).Verification
Manual before/after: a 1-minute cron edited to v2/v3 previously left multiple duplicate jobs firing and could not be deleted; after the fix only the latest definition fires and a NIP-09 deletion stops it immediately.
Reviewer note — pre-existing duplication left out of scope
While extracting
nip33_stale_write_guard, I deduped the two NIP-33 replacement copies the reviewer flagged (replace_parameterized_event+ the new command path). A third copy of the FNV-1a advisory-lock-key snippet still lives inDb::replace_addressable_event(relay-signed NIP-29 group metadata, kinds 39000–39002). It predates this PR (present onmain) and was intentionally left untouched here:channel_id(the relay is the author, so pubkey isn't the natural key), so it isn't a drop-in for the same helper.Flagging it as a good standalone follow-up (factor the lock-key into a shared
fnv_coordinate_lock_keyand have all three call it).