Skip to content

[ZEPPELIN-6419] Fix clone paragraph content loss caused by overly aggressive shortCircuit seq filtering#5254

Open
kevinjmh wants to merge 2 commits into
apache:masterfrom
kevinjmh:ZEPPELIN-6419
Open

[ZEPPELIN-6419] Fix clone paragraph content loss caused by overly aggressive shortCircuit seq filtering#5254
kevinjmh wants to merge 2 commits into
apache:masterfrom
kevinjmh:ZEPPELIN-6419

Conversation

@kevinjmh
Copy link
Copy Markdown
Member

What is this PR for?

Problem
When cloning a paragraph in zeppelin-web-angular, the cloned paragraph only contains the interpreter binding (e.g., %mysql) but not the actual editor content (e.g., select 1 a, 2 a). The old zeppelin-web handles this correctly.

Root Cause Analysis
The backend copyParagraph is a two-step operation:

1. insertParagraph() → broadcasts PARAGRAPH_ADDED (empty new paragraph, text="%mysql\n")
2. updateParagraph() → broadcasts PARAGRAPH        (full text="%mysql\nselect 1 a, 2 a")

Step 2's PARAGRAPH response gets silently discarded by the frontend shortCircuit mechanism in message.ts. The original filter logic compares the currently-sent message sequence number against the received response's sequence number:

// OLD logic — overly aggressive
if (this.lastMsgIdSeqSent > msgIdSeqReceived) {
  // "message is already updated by shortcircuit" → discard!
  return false;
}

The problem: between sending COPY_PARAGRAPH (seq=49) and receiving its PARAGRAPH response, other unrelated messages like EDITOR_SETTING (seq=50) may be sent. This makes lastMsgIdSeqSent (50) > msgIdSeqReceived (49), causing the legitimate PARAGRAPH response for the cloned paragraph to be incorrectly filtered out.

image

Solution
Replace the implicit sequence-number comparison with an explicit tracking set (shortCircuitedParagraphMsgIds). Only messages that were explicitly passed to shortCircuit() get filtered — not all messages where lastMsgIdSeqSent > receivedSeq.

What type of PR is it?

Bug Fix

Todos

  • - Task

What is the Jira issue?

How should this be tested?

  • Strongly recommended: add automated unit tests for any new or changed behavior
  • Outline any manual steps to test the PR here.

Screenshots (if appropriate)

Questions:

  • Does the license files need to update?
  • Is there breaking changes for older versions?
  • Does this needs documentation?

@pan3793 pan3793 requested a review from tbonelee May 23, 2026 12:53
Copy link
Copy Markdown
Contributor

@tbonelee tbonelee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for tracking this down. I've been going through both the client and server side and wanted to share a few observations worth weighing before landing. I might be missing context on some of these, so please correct me where I'm wrong.

1. The new tracking Set may not be populated under current code paths

The only caller of shortCircuit() I could find is runParagraph() (message.ts:432), which invokes it with op: OP.PARAGRAPH_STATUS and no msgId. The guard in the new shortCircuit():

if (message.op === OP.PARAGRAPH && message.msgId) { ... add to set ... }

doesn't appear to match in that path, so shortCircuitedParagraphMsgIds would stay empty during normal usage. If that's right, the receive-side filter would always pass through, and the runtime effect of this PR becomes "PARAGRAPH filter is bypassed." That does fix clone loss, but the Set / cap / eviction logic could end up being dead in practice. Worth checking whether there's a future caller you have in mind.

2. The 2020 heuristic may have been targeting something different from what it actually dropped

The server emits OP.PARAGRAPH in two regimes:

With echoed msgId (visible to the filter):

  • COMMIT_PARAGRAPH: NotebookServer.java:1095
  • PARAGRAPH_CLEAR_OUTPUT: NotebookServer.java:1241
  • COPY_PARAGRAPH: NotebookServer.java:1464 calls updateParagraph which broadcasts at :1095. This seems to be the clone-loss path.
  • Personalized-mode RUN_PARAGRAPH unicast: NotebookServer.java:1549

Without msgId (bypassed by if (!message.msgId) return true):

  • Paragraph execution lifecycle (PENDING / RUNNING / FINISHED): NotebookServer.java:2008 uses MSG_ID_NOT_DEFINED
  • Background output clears: :1756
  • runAllParagraphs fallback: :1499

Non-personalized runParagraph (NotebookServer.java:1528-1565) doesn't seem to echo OP.PARAGRAPH with msgId from onSuccess. The status transitions arrive via the lifecycle path without msgId and bypass the filter anyway. The server path may have looked different at the time ZEPPELIN-4985 was introduced, so the heuristic might have been more accurate originally.

3. Suggestion: keep shortCircuit, but drop the dedup branch

The instant-PENDING feedback from shortCircuit({op: OP.PARAGRAPH_STATUS, ...}) is routed to onParagraphStatus (paragraph-base.ts:81), so it's independent of the PARAGRAPH filter. And paragraph-base.ts:218 isUpdateRequired() looks like it would absorb at least some of the no-op updates from redundant echoes. Given those two, dropping the PARAGRAPH filter branch entirely might be a simpler fix than rebuilding the dedup on top of the Set.

Roughly -21 / +0 in message.ts:

receive<K extends keyof MessageReceiveDataTypeMap>(op: K): Observable<...> {
  return this.received$.pipe(
    filter(message => message.op === op),
    map(message => message.data)
  ) as Observable<...>;
}

shortCircuit(message: WebSocketMessage<MessageReceiveDataTypeMap>) {
  this.received$.next(this.interceptReceived(message));
}

Plus removing the MAX_SHORT_CIRCUIT_SIZE constant and shortCircuitedParagraphMsgIds field.

4. Local check

With an artificial Thread.sleep on the server's paragraph run path:

  • Clone preserves text correctly
  • PENDING / RUNNING / FINISHED transitions remain instant through the existing shortCircuit(PARAGRAPH_STATUS) path

I might be missing scenarios the original code was trying to handle, so feel free to push back if anything here doesn't add up.

@tbonelee
Copy link
Copy Markdown
Contributor

@jongyoul @voidmatcha Could you take a look at this when you get a chance? I only verified the clone preservation and PENDING / RUNNING / FINISHED transition scenarios locally, and there might be other cases the original filter was trying to handle that I didn't think to check.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants