Skip to content

[Issue 1448][Producer] Fix producer stuck in Send() function when message chunk number is larger than MaxPendingMessages#1462

Closed
geniusjoe wants to merge 1 commit intoapache:masterfrom
geniusjoe:bugfix/fix-producer-stuck-when-sending-chunk-message-with-block-pending-queue
Closed

[Issue 1448][Producer] Fix producer stuck in Send() function when message chunk number is larger than MaxPendingMessages#1462
geniusjoe wants to merge 1 commit intoapache:masterfrom
geniusjoe:bugfix/fix-producer-stuck-when-sending-chunk-message-with-block-pending-queue

Conversation

@geniusjoe
Copy link
Contributor

Fixes #1448

Motivation

Refer to issue:

When we produce a message with payload > maxChunkSize * MaxPendingMessages, this single message will occupy the entire partition producer's p.publishSemaphore and cannot be released, causing the entire partition producer sending progress block forever.

The primary reason why the Java SDK does not have a message size limit, is due to its different chunk message generation strategy:
In Java, for each chunk split from the original message, a semaphore for one message is acquired and then the chunk is written to the pendingQueue for asynchronous sending.
In Go, however, the system must wait until all semaphores for the current message are acquired before sending the entire batch to the pendingQueue. We can see testBlockIfQueueFullWhenChunking() describe this issue in Java code:

https://github.com/apache/pulsar/blob/f0ec07b3d8c5cfe36942957fc0ad32e40d69320d/pulsar-client/src/main/java/org/apache/pulsar/client/impl/ProducerImpl.java#L657

for (int chunkId = 0; chunkId < totalChunks; chunkId++) {
    ...
    // check pendingQueue permit 
    if (chunkId > 0 && conf.isBlockIfQueueFull() && !canEnqueueRequest(callback,
            message.getSequenceId(), 0 /* The memory was already reserved */)) {
        ...
        return;
    }
    // send chunk message individually 
    synchronized (this) {
        // Update the message metadata before computing the payload chunk size
        // to avoid a large message cannot be split into chunks.
        final long sequenceId = updateMessageMetadataSequenceId(msgMetadata);
        String uuid = totalChunks > 1 ? String.format("%s-%d", producerName, sequenceId) : null;
        serializeAndSendMessage(msg, payload, sequenceId, uuid, chunkId, totalChunks,
                readStartIndex, payloadChunkSize, compressedPayload, compressed,
                compressedPayload.readableBytes(), callback, chunkedMessageCtx, messageId);
        readStartIndex = ((chunkId + 1) * payloadChunkSize);
    }
}

Modifications

  1. Added validation for message payload size and pendingQueue in pulsar/producer_partition.go#updateChunkInfo. However, the current bugfix does not resolve the potential deadlock issue caused by multiple chunk messages refer to https://github.com/apache/pulsar/blob/f0ec07b3d8c5cfe36942957fc0ad32e40d69320d/pulsar-broker/src/test/java/org/apache/pulsar/client/impl/MessageChunkingTest.java#L706

  2. Added cleanup handling when chunk messages occupy a portion of the semaphore, fixing the issue where the semaphore is not released after destruction in sendRequest.done()

Verifying this change

  • Make sure that the change passes the CI checks.

This change added tests and can be verified as follows:
TestChunkBlockIfQueueFullWithoutTimeout
TestSemaphoreStateWithChunkAndTimeout

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): (no)
  • The public API: (no)
  • The schema: (no)
  • The default values of configurations: (no)
  • The wire protocol: (no)

Documentation

  • Does this pull request introduce a new feature? (no)

@geniusjoe geniusjoe force-pushed the bugfix/fix-producer-stuck-when-sending-chunk-message-with-block-pending-queue branch from c5cb6a9 to 93cb0b6 Compare January 29, 2026 15:33
@RobertIndie RobertIndie requested a review from Copilot January 30, 2026 08:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a producer deadlock when sending a chunked message whose chunk count exceeds MaxPendingMessages, and ensures semaphore permits are correctly released in partial-failure/timeout paths.

Changes:

  • Add chunk-count vs publishSemaphore capacity validation in updateChunkInfo, returning ErrMessageTooLarge when a single message would require more semaphores than the pending queue can support.
  • Track and release the exact number of acquired semaphore permits via reservedSemaphore, including for chunked messages and partial-acquire failure paths, and recycle the parent sendRequest after splitting into per-chunk sendRequests.
  • Extend the internal Semaphore interface with MaxPermits and CurrentPermits accessors and add tests around semaphore state and chunking behavior to verify no permit leakage and correct behavior when queue limits are exceeded.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
pulsar/producer_partition.go Enforces a cap on chunks per message relative to MaxPendingMessages, adjusts semaphore reservation/release logic for chunked sends, and recycles the original sendRequest after chunk splitting.
pulsar/message_chunking_test.go Adds regression tests for blocking behavior without timeouts and for semaphore state (with and without chunking, and with timeouts) to ensure no stuck producers and no permit leaks.
pulsar/internal/semaphore.go Extends the semaphore interface/implementation with MaxPermits() and CurrentPermits() to support the new validation and testing around pending queue capacity.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

}

// Since a single message first generates all chunks and then acquires the semaphore at once,
// we need to ensure the number of chunks generated by a single message is less than the size of the pendingQueue.
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment here says the number of chunks must be "less than" the size of the pending queue, but the actual condition allows totalChunks == MaxPermits() (it only rejects when totalChunks > MaxPermits). To avoid confusion for future maintainers, please either update the wording to "less than or equal to" or tighten the condition to match the current wording.

Suggested change
// we need to ensure the number of chunks generated by a single message is less than the size of the pendingQueue.
// we need to ensure the number of chunks generated by a single message is less than or equal to the size of the pendingQueue.

Copilot uses AI. Check for mistakes.
Comment on lines +533 to +539
func TestChunkBlockIfQueueFullWithoutTimeout(t *testing.T) {
client, err := NewClient(ClientOptions{
URL: lookupURL,
})
if err != nil {
t.Fatal(err)
}
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NewClient allocates network resources but this test never calls client.Close(), which can leave goroutines and connections open across tests. To keep resource usage consistent with other tests in this file, consider deferring client.Close() after successful client creation.

Copilot uses AI. Check for mistakes.
})
if err != nil {
t.Fatal(err)
}
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to TestSemaphoreState and TestSemaphoreStateWithChunk, this test creates a client but never calls client.Close(), which can leak resources across tests. Please add a defer client.Close() after client creation for consistency and to avoid lingering connections.

Suggested change
}
}
defer client.Close()

Copilot uses AI. Check for mistakes.

ctx, cancel := context.WithTimeout(context.Background(), 10*time.Second)
defer cancel()
// Large messages will be split into 11 chunks, exceeding the length of pending queue
Copy link

Copilot AI Jan 30, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says this payload will be split into 11 chunks, but with ChunkMaxMessageSize set to 10 and a payload size of 100, the chunking logic computes totalChunks = ceil(100 / 10) = 10. To prevent confusion when someone adjusts these tests or debugging chunking behavior, please update the comment to reflect the actual number of chunks (or rephrase it generically as "more chunks than the pending queue capacity").

Suggested change
// Large messages will be split into 11 chunks, exceeding the length of pending queue
// Large messages will be split into more chunks than the length of the pending queue

Copilot uses AI. Check for mistakes.
Copy link
Member

@RobertIndie RobertIndie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Under the default block-if-queue-full behavior, users expect maxPendingMessages to act purely as a throttling mechanism, not as a trigger for exceptions.

In this case, could we allow temporarily exceeding maxPendingMessages at this case? I don’t think we need strict or exact enforcement here.

@geniusjoe
Copy link
Contributor Author

geniusjoe commented Feb 24, 2026

Under the default block-if-queue-full behavior, users expect maxPendingMessages to act purely as a throttling mechanism, not as a trigger for exceptions.
In this case, could we allow temporarily exceeding maxPendingMessages at this case? I don’t think we need strict or exact enforcement here.

I agree with your assessment. The implementation in this PR, which uses exceptions to reject chunked messages, does not adequately meet user requirements. We will close this PR for now and consider implementing a new approach based on throttling in the future.

@geniusjoe geniusjoe closed this Feb 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] Producer will stuck in Send() when a message chunk number larger than MaxPendingMessages

3 participants